Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Tez branch and tez based patches

Copy link to this message
Re: Tez branch and tez based patches
Commit then review, and self commit, destroys the good things we get from
our normal system.


I am most worried about silo's and knowledge, lax testing policies, and
code quality. Which I now have seen on several occasions when something is
happening in a branch. (not calling out tez branch in particular)

On Fri, Aug 16, 2013 at 9:13 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> I still am not sure we are doing this the ideal way. I am not a believer
> in a commit-then-review branch.
> This issue is an example.
> https://issues.apache.org/jira/browse/HIVE-5108
> I ask myself these questions:
> Does this currently work? Are their tests? If so which ones are broken?
> How does the patch fix them without tests to validate?
> Having a commit-then-review branch just seems subversive to our normal
> process, and a quick short cut to not have to be bothered by writing tests
> or involving anyone else.
> On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>> On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:
>> > Also watched http://www.ustream.tv/recorded/36323173
>> >
>> > I definitely see the win in being able to stream inter-stage output.
>> >
>> > I see some cases where small intermediate results can be kept "In
>> memory".
>> > But I was somewhat under the impression that the map reduce spill
>> settings
>> > kept stuff in memory, isn't that what spill settings are?
>> No.  MapReduce always writes shuffle data to local disk.  And
>> intermediate results between MR jobs are always persisted to HDFS, as
>> there's no other option.  When we talk of being able to keep intermediate
>> results in memory we mean getting rid of both of these disk writes/reads
>> when appropriate (meaning not always, there's a trade off between speed and
>> error handling to be made here, see below for more details).
>> >
>> > There is a few bullet points that came up repeatedly that I do not
>> follow:
>> >
>> > Something was said to the effect of "Container reuse makes X faster".
>> > Hadoop has jvm reuse. Not following what the difference is here? Not
>> > everyone has a 10K node cluster.
>> Sharing JVMs across users is inherently insecure (we can't guarantee what
>> code the first user left behind that may interfere with later users).  As I
>> understand container re-use in Tez it constrains the re-use to one user for
>> security reasons, but still avoids additional JVM start up costs.  But this
>> is a question that the Tez guys could answer better on the Tez lists (
>> >
>> > "Joins in map reduce are hard" Really? I mean some of them are I guess,
>> but
>> > the typical join is very easy. Just shuffle by the join key. There was
>> not
>> > really enough low level details here saying why joins are better in tez.
>> Join is not a natural operation in MapReduce.  MR gives you one input and
>> one output.  You end up having to bend the rules to do have multiple
>> inputs.  The idea here is that Tez can provide operators that naturally
>> work with joins and other operations that don't fit the one input/one
>> output model (eg unions, etc.).
>> >
>> > "Chosing the number of maps and reduces is hard" Really? I do not find
>> it
>> > that hard, I think there are times when it's not perfect but I do not
>> find
>> > it hard. The talk did not really offer anything here technical on how
>> tez
>> > makes this better other then it could make it better.
>> Perhaps manual would be a better term here than hard.  In our experience
>> it takes quite a bit of engineer trial and error to determine the optimal
>> numbers.  This may be ok if you're going to invest the time once and then
>> run the same query every day for 6 months.  But obviously it doesn't work
>> for the ad hoc case.  Even in the batch case it's not optimal because every
>> once and a while an engineer has to go back and re-optimize the query to