Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - About full pipeline between pig jobs

Copy link to this message
Re: About full pipeline between pig jobs
W W 2012-10-22, 15:31
Thanks Alan for your nice explanation.

I am not quite familiar with YARN, but it seems to me M/R architecture is
not fully supportive of pipeline data flow paradigm in its core.  Unless
there be a strong Resource Navigator that could navigate between M/R jobs
and control the flow, it's impossible to have a pipeline style execution of
pig script.

I think a multi-channel parallel pipeline paradigm can greatly improve the
performance of Pig on ETL tasks.
And hope that can be realized with YARN.
2012/10/22 Alan Gates <[EMAIL PROTECTED]>

> At this point, no.  In the current MapReduce infrastructure it would take
> a lot of hackery that breaks the MR abstraction to make this work[1].  This
> is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka
> YARN) where it is easier for applications to build these types of features.
> [1]  Details on why this is so:  Assume you want to pipeline two jobs.
>  When job 1 gets to it's reduces, it has to pause until job 2 starts,
> because it can't know where job 2's map tasks will run a priori.  Job 1's
> reducer has to be able to handle the case where job 2's map task fails and
> it needs to restart the streaming, which means it has to spool to HDFS
> anyway.  In the same way job 2's map tasks need to be able to handle
> failure and restart of job 1's reducer (which is easier, they could just
> die).  Plus you need to handle the possibility of dead locks (ie, so much
> of your cluster or your user's quota may be taken up by job 1 that job 2
> will never start or get enough map tasks until job 1 ends).  Current
> MapReduce strongly discourages intertask communication for exactly these
> reasons.
> Alan.
> On Oct 22, 2012, at 3:34 AM, W W wrote:
> > Hello,
> >
> > I wonder if M/R jobs compiled from pig script support pipeline between
> jobs.
> >
> > For example, let's assume there  are 5 independent consecutive M/R jobs
> > doing some joining and aggregating task.
> > My question is can one job be started before it's previous job finished
> so
> > that the previous job doesn't need to write all the output data from
> reduce
> > to HDFS , I just can't find any material talking about this.
> >
> > I think  Abinitio is a good example for the full pipeline architecture.
> >
> > Thanks & Regards
> > Stephen