Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # dev >> Tez branch and tez based patches


Copy link to this message
-
Re: Tez branch and tez based patches
Commit then review, and self commit, destroys the good things we get from
our normal system.

http://anna.gs/blog/2013/08/12/code-review-ftw/

I am most worried about silo's and knowledge, lax testing policies, and
code quality. Which I now have seen on several occasions when something is
happening in a branch. (not calling out tez branch in particular)

On Fri, Aug 16, 2013 at 9:13 AM, Edward Capriolo <[EMAIL PROTECTED]>wrote:

> I still am not sure we are doing this the ideal way. I am not a believer
> in a commit-then-review branch.
>
> This issue is an example.
>
> https://issues.apache.org/jira/browse/HIVE-5108
>
> I ask myself these questions:
> Does this currently work? Are their tests? If so which ones are broken?
> How does the patch fix them without tests to validate?
>
> Having a commit-then-review branch just seems subversive to our normal
> process, and a quick short cut to not have to be bothered by writing tests
> or involving anyone else.
>
>
>
> On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <[EMAIL PROTECTED]> wrote:
>
>>
>> On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:
>>
>> > Also watched http://www.ustream.tv/recorded/36323173
>> >
>> > I definitely see the win in being able to stream inter-stage output.
>> >
>> > I see some cases where small intermediate results can be kept "In
>> memory".
>> > But I was somewhat under the impression that the map reduce spill
>> settings
>> > kept stuff in memory, isn't that what spill settings are?
>>
>> No.  MapReduce always writes shuffle data to local disk.  And
>> intermediate results between MR jobs are always persisted to HDFS, as
>> there's no other option.  When we talk of being able to keep intermediate
>> results in memory we mean getting rid of both of these disk writes/reads
>> when appropriate (meaning not always, there's a trade off between speed and
>> error handling to be made here, see below for more details).
>>
>> >
>> > There is a few bullet points that came up repeatedly that I do not
>> follow:
>> >
>> > Something was said to the effect of "Container reuse makes X faster".
>> > Hadoop has jvm reuse. Not following what the difference is here? Not
>> > everyone has a 10K node cluster.
>>
>> Sharing JVMs across users is inherently insecure (we can't guarantee what
>> code the first user left behind that may interfere with later users).  As I
>> understand container re-use in Tez it constrains the re-use to one user for
>> security reasons, but still avoids additional JVM start up costs.  But this
>> is a question that the Tez guys could answer better on the Tez lists (
>> [EMAIL PROTECTED])
>>
>> >
>> > "Joins in map reduce are hard" Really? I mean some of them are I guess,
>> but
>> > the typical join is very easy. Just shuffle by the join key. There was
>> not
>> > really enough low level details here saying why joins are better in tez.
>>
>> Join is not a natural operation in MapReduce.  MR gives you one input and
>> one output.  You end up having to bend the rules to do have multiple
>> inputs.  The idea here is that Tez can provide operators that naturally
>> work with joins and other operations that don't fit the one input/one
>> output model (eg unions, etc.).
>>
>> >
>> > "Chosing the number of maps and reduces is hard" Really? I do not find
>> it
>> > that hard, I think there are times when it's not perfect but I do not
>> find
>> > it hard. The talk did not really offer anything here technical on how
>> tez
>> > makes this better other then it could make it better.
>>
>> Perhaps manual would be a better term here than hard.  In our experience
>> it takes quite a bit of engineer trial and error to determine the optimal
>> numbers.  This may be ok if you're going to invest the time once and then
>> run the same query every day for 6 months.  But obviously it doesn't work
>> for the ad hoc case.  Even in the batch case it's not optimal because every
>> once and a while an engineer has to go back and re-optimize the query to
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB