-Re: About full pipeline between pig jobs
W W 2012-10-22, 15:31
Thanks Alan for your nice explanation.
I am not quite familiar with YARN, but it seems to me M/R architecture is
not fully supportive of pipeline data flow paradigm in its core. Unless
there be a strong Resource Navigator that could navigate between M/R jobs
and control the flow, it's impossible to have a pipeline style execution of
I think a multi-channel parallel pipeline paradigm can greatly improve the
performance of Pig on ETL tasks.
And hope that can be realized with YARN.
2012/10/22 Alan Gates <[EMAIL PROTECTED]>
> At this point, no. In the current MapReduce infrastructure it would take
> a lot of hackery that breaks the MR abstraction to make this work. This
> is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka
> YARN) where it is easier for applications to build these types of features.
>  Details on why this is so: Assume you want to pipeline two jobs.
> When job 1 gets to it's reduces, it has to pause until job 2 starts,
> because it can't know where job 2's map tasks will run a priori. Job 1's
> reducer has to be able to handle the case where job 2's map task fails and
> it needs to restart the streaming, which means it has to spool to HDFS
> anyway. In the same way job 2's map tasks need to be able to handle
> failure and restart of job 1's reducer (which is easier, they could just
> die). Plus you need to handle the possibility of dead locks (ie, so much
> of your cluster or your user's quota may be taken up by job 1 that job 2
> will never start or get enough map tasks until job 1 ends). Current
> MapReduce strongly discourages intertask communication for exactly these
> On Oct 22, 2012, at 3:34 AM, W W wrote:
> > Hello,
> > I wonder if M/R jobs compiled from pig script support pipeline between
> > For example, let's assume there are 5 independent consecutive M/R jobs
> > doing some joining and aggregating task.
> > My question is can one job be started before it's previous job finished
> > that the previous job doesn't need to write all the output data from
> > to HDFS , I just can't find any material talking about this.
> > I think Abinitio is a good example for the full pipeline architecture.
> > Thanks & Regards
> > Stephen