Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive, mail # dev - Tez branch and tez based patches


Copy link to this message
-
Re: Tez branch and tez based patches
Edward Capriolo 2013-08-16, 13:13
I still am not sure we are doing this the ideal way. I am not a believer in
a commit-then-review branch.

This issue is an example.

https://issues.apache.org/jira/browse/HIVE-5108

I ask myself these questions:
Does this currently work? Are their tests? If so which ones are broken? How
does the patch fix them without tests to validate?

Having a commit-then-review branch just seems subversive to our normal
process, and a quick short cut to not have to be bothered by writing tests
or involving anyone else.

On Mon, Aug 5, 2013 at 1:54 PM, Alan Gates <[EMAIL PROTECTED]> wrote:

>
> On Jul 29, 2013, at 9:53 PM, Edward Capriolo wrote:
>
> > Also watched http://www.ustream.tv/recorded/36323173
> >
> > I definitely see the win in being able to stream inter-stage output.
> >
> > I see some cases where small intermediate results can be kept "In
> memory".
> > But I was somewhat under the impression that the map reduce spill
> settings
> > kept stuff in memory, isn't that what spill settings are?
>
> No.  MapReduce always writes shuffle data to local disk.  And intermediate
> results between MR jobs are always persisted to HDFS, as there's no other
> option.  When we talk of being able to keep intermediate results in memory
> we mean getting rid of both of these disk writes/reads when appropriate
> (meaning not always, there's a trade off between speed and error handling
> to be made here, see below for more details).
>
> >
> > There is a few bullet points that came up repeatedly that I do not
> follow:
> >
> > Something was said to the effect of "Container reuse makes X faster".
> > Hadoop has jvm reuse. Not following what the difference is here? Not
> > everyone has a 10K node cluster.
>
> Sharing JVMs across users is inherently insecure (we can't guarantee what
> code the first user left behind that may interfere with later users).  As I
> understand container re-use in Tez it constrains the re-use to one user for
> security reasons, but still avoids additional JVM start up costs.  But this
> is a question that the Tez guys could answer better on the Tez lists (
> [EMAIL PROTECTED])
>
> >
> > "Joins in map reduce are hard" Really? I mean some of them are I guess,
> but
> > the typical join is very easy. Just shuffle by the join key. There was
> not
> > really enough low level details here saying why joins are better in tez.
>
> Join is not a natural operation in MapReduce.  MR gives you one input and
> one output.  You end up having to bend the rules to do have multiple
> inputs.  The idea here is that Tez can provide operators that naturally
> work with joins and other operations that don't fit the one input/one
> output model (eg unions, etc.).
>
> >
> > "Chosing the number of maps and reduces is hard" Really? I do not find it
> > that hard, I think there are times when it's not perfect but I do not
> find
> > it hard. The talk did not really offer anything here technical on how tez
> > makes this better other then it could make it better.
>
> Perhaps manual would be a better term here than hard.  In our experience
> it takes quite a bit of engineer trial and error to determine the optimal
> numbers.  This may be ok if you're going to invest the time once and then
> run the same query every day for 6 months.  But obviously it doesn't work
> for the ad hoc case.  Even in the batch case it's not optimal because every
> once and a while an engineer has to go back and re-optimize the query to
> deal with changing data sizes, data characteristics, etc.  We want the
> optimizer to handle this without human intervention.
>
> >
> > The presentations mentioned streaming data, how do two nodes stream data
> > between a tasks and how it it reliable? If the sender or receiver dies
> does
> > the entire process have to start again?
>
> If the sender or receiver dies then the query has to be restarted from
> some previous point where data was persisted to disk.  The idea here is
> that speed vs error recovery trade offs should be made by the optimizer.