|
|
-
About full pipeline between pig jobs
W W 2012-10-22, 10:34
Hello,
I wonder if M/R jobs compiled from pig script support pipeline between jobs.
For example, let's assume there are 5 independent consecutive M/R jobs doing some joining and aggregating task. My question is can one job be started before it's previous job finished so that the previous job doesn't need to write all the output data from reduce to HDFS , I just can't find any material talking about this.
I think Abinitio is a good example for the full pipeline architecture.
Thanks & Regards Stephen
-
Re: About full pipeline between pig jobs
Alan Gates 2012-10-22, 13:16
At this point, no. In the current MapReduce infrastructure it would take a lot of hackery that breaks the MR abstraction to make this work[1]. This is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka YARN) where it is easier for applications to build these types of features.
[1] Details on why this is so: Assume you want to pipeline two jobs. When job 1 gets to it's reduces, it has to pause until job 2 starts, because it can't know where job 2's map tasks will run a priori. Job 1's reducer has to be able to handle the case where job 2's map task fails and it needs to restart the streaming, which means it has to spool to HDFS anyway. In the same way job 2's map tasks need to be able to handle failure and restart of job 1's reducer (which is easier, they could just die). Plus you need to handle the possibility of dead locks (ie, so much of your cluster or your user's quota may be taken up by job 1 that job 2 will never start or get enough map tasks until job 1 ends). Current MapReduce strongly discourages intertask communication for exactly these reasons.
Alan.
On Oct 22, 2012, at 3:34 AM, W W wrote:
> Hello, > > I wonder if M/R jobs compiled from pig script support pipeline between jobs. > > For example, let's assume there are 5 independent consecutive M/R jobs > doing some joining and aggregating task. > My question is can one job be started before it's previous job finished so > that the previous job doesn't need to write all the output data from reduce > to HDFS , I just can't find any material talking about this. > > I think Abinitio is a good example for the full pipeline architecture. > > Thanks & Regards > Stephen
-
Re: About full pipeline between pig jobs
W W 2012-10-22, 15:31
Thanks Alan for your nice explanation.
I am not quite familiar with YARN, but it seems to me M/R architecture is not fully supportive of pipeline data flow paradigm in its core. Unless there be a strong Resource Navigator that could navigate between M/R jobs and control the flow, it's impossible to have a pipeline style execution of pig script.
I think a multi-channel parallel pipeline paradigm can greatly improve the performance of Pig on ETL tasks. And hope that can be realized with YARN. 2012/10/22 Alan Gates <[EMAIL PROTECTED]>
> At this point, no. In the current MapReduce infrastructure it would take > a lot of hackery that breaks the MR abstraction to make this work[1]. This > is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka > YARN) where it is easier for applications to build these types of features. > > [1] Details on why this is so: Assume you want to pipeline two jobs. > When job 1 gets to it's reduces, it has to pause until job 2 starts, > because it can't know where job 2's map tasks will run a priori. Job 1's > reducer has to be able to handle the case where job 2's map task fails and > it needs to restart the streaming, which means it has to spool to HDFS > anyway. In the same way job 2's map tasks need to be able to handle > failure and restart of job 1's reducer (which is easier, they could just > die). Plus you need to handle the possibility of dead locks (ie, so much > of your cluster or your user's quota may be taken up by job 1 that job 2 > will never start or get enough map tasks until job 1 ends). Current > MapReduce strongly discourages intertask communication for exactly these > reasons. > > Alan. > > On Oct 22, 2012, at 3:34 AM, W W wrote: > > > Hello, > > > > I wonder if M/R jobs compiled from pig script support pipeline between > jobs. > > > > For example, let's assume there are 5 independent consecutive M/R jobs > > doing some joining and aggregating task. > > My question is can one job be started before it's previous job finished > so > > that the previous job doesn't need to write all the output data from > reduce > > to HDFS , I just can't find any material talking about this. > > > > I think Abinitio is a good example for the full pipeline architecture. > > > > Thanks & Regards > > Stephen > >
|
|