Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> About full pipeline between pig jobs


Copy link to this message
-
Re: About full pipeline between pig jobs
Thanks Alan for your nice explanation.

I am not quite familiar with YARN, but it seems to me M/R architecture is
not fully supportive of pipeline data flow paradigm in its core.  Unless
there be a strong Resource Navigator that could navigate between M/R jobs
and control the flow, it's impossible to have a pipeline style execution of
pig script.

I think a multi-channel parallel pipeline paradigm can greatly improve the
performance of Pig on ETL tasks.
And hope that can be realized with YARN.
2012/10/22 Alan Gates <[EMAIL PROTECTED]>

> At this point, no.  In the current MapReduce infrastructure it would take
> a lot of hackery that breaks the MR abstraction to make this work[1].  This
> is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka
> YARN) where it is easier for applications to build these types of features.
>
> [1]  Details on why this is so:  Assume you want to pipeline two jobs.
>  When job 1 gets to it's reduces, it has to pause until job 2 starts,
> because it can't know where job 2's map tasks will run a priori.  Job 1's
> reducer has to be able to handle the case where job 2's map task fails and
> it needs to restart the streaming, which means it has to spool to HDFS
> anyway.  In the same way job 2's map tasks need to be able to handle
> failure and restart of job 1's reducer (which is easier, they could just
> die).  Plus you need to handle the possibility of dead locks (ie, so much
> of your cluster or your user's quota may be taken up by job 1 that job 2
> will never start or get enough map tasks until job 1 ends).  Current
> MapReduce strongly discourages intertask communication for exactly these
> reasons.
>
> Alan.
>
> On Oct 22, 2012, at 3:34 AM, W W wrote:
>
> > Hello,
> >
> > I wonder if M/R jobs compiled from pig script support pipeline between
> jobs.
> >
> > For example, let's assume there  are 5 independent consecutive M/R jobs
> > doing some joining and aggregating task.
> > My question is can one job be started before it's previous job finished
> so
> > that the previous job doesn't need to write all the output data from
> reduce
> > to HDFS , I just can't find any material talking about this.
> >
> > I think  Abinitio is a good example for the full pipeline architecture.
> >
> > Thanks & Regards
> > Stephen
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB