W W 2012-10-22, 10:34
-Re: About full pipeline between pig jobs
Alan Gates 2012-10-22, 13:16
At this point, no. In the current MapReduce infrastructure it would take a lot of hackery that breaks the MR abstraction to make this work. This is one thing we'd like to do as we move Pig to work on Hadoop 2.0 (aka YARN) where it is easier for applications to build these types of features.
 Details on why this is so: Assume you want to pipeline two jobs. When job 1 gets to it's reduces, it has to pause until job 2 starts, because it can't know where job 2's map tasks will run a priori. Job 1's reducer has to be able to handle the case where job 2's map task fails and it needs to restart the streaming, which means it has to spool to HDFS anyway. In the same way job 2's map tasks need to be able to handle failure and restart of job 1's reducer (which is easier, they could just die). Plus you need to handle the possibility of dead locks (ie, so much of your cluster or your user's quota may be taken up by job 1 that job 2 will never start or get enough map tasks until job 1 ends). Current MapReduce strongly discourages intertask communication for exactly these reasons.
On Oct 22, 2012, at 3:34 AM, W W wrote:
> I wonder if M/R jobs compiled from pig script support pipeline between jobs.
> For example, let's assume there are 5 independent consecutive M/R jobs
> doing some joining and aggregating task.
> My question is can one job be started before it's previous job finished so
> that the previous job doesn't need to write all the output data from reduce
> to HDFS , I just can't find any material talking about this.
> I think Abinitio is a good example for the full pipeline architecture.
> Thanks & Regards
W W 2012-10-22, 15:31