"The temporary file solution will work in a single node configuration, but
I'm not sure about an MPP config.
Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
both jobs run on all 4 nodes - will HDFS be able to redistribute
automagically the records between nodes or does this need to be coded
Correct me if I misunderstood your problem but there shouldn't be any
concern about this point. Your output of Job 1 will be on HDFS, the same
file system which your Job 2 will use to read its input from. This file
system hides from you where the data exists actually node in the cluster.
Your Job2 should read the output of Job1 from an HDFS path. Also, you can
make your second Job 2 dependent on the completion of Job 1. You can do
that in the driver code. That way your Job 2 will only run if Job1 has
Maybe I am missing something here e.g. Why are you using ChainReducer in
"Is there a better way to pass the output from Job A as input to Job B
(e.g. using network streams or some built in java classes that don't do
disk i/o)? "
Maybe Hadoop streaming? But then you have to construct and design your jobs
in such way which seems to me an overhead. Maybe experts can help.
On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER
> Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
> prefer to keep, if possible, everything as close to the hadoop libraries.
> I am sure I am overlooking something basic as repartitioning is a fairly
> common operation in MPP environments.
> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <[EMAIL PROTECTED]>wrote:
>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>> think there are other (Spring integration maybe?) but I'm not familiar with
>> them enough to make a recommendation.
>> Note that with Cascading and Pig you don't write 'map reduce' you write
>> logic and they map it to the various mapper/reducer steps automatically.
>> Hope this helps,
>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <[EMAIL PROTECTED]
>> > wrote:
>>> My application requires 2 distinct processing steps (reducers) to be
>>> performed on the input data. The first operation generates changes the key
>>> values and, records that had different keys in step 1 can end up having the
>>> same key in step 2.
>>> The heavy lifting of the operation is in step1 and step2 only combines
>>> records where keys were changed.
>>> In short the overview is:
>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>> To implement this in hadoop, it seems that I need to create a separate
>>> job for each step.
>>> Now I assumed, there would some sort of job management under hadoop to
>>> link Job 1 and 2, but the only thing I could find was related to job
>>> scheduling and nothing on how to synchronize the input/output of the linked
>>> The only crude solution that I can think of is to use a temporary file
>>> under HDFS, but even so I'm not sure if this will work.
>>> The overview of the process would be:
>>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>>> (key2, value 3)] => output.
>>> Is there a better way to pass the output from Job A as input to Job B
>>> (e.g. using network streams or some built in java classes that don't do
>>> disk i/o)?
>>> The temporary file solution will work in a single node configuration,
>>> but I'm not sure about an MPP config.
>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>>> automagically the records between nodes or does this need to be coded