Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - chaining (the output of) jobs/ reducers


Copy link to this message
-
Re: chaining (the output of) jobs/ reducers
Adrian CAPDEFIER 2013-09-12, 16:35
Thank you, Chris. I will look at Cascading and Pig, but for starters I'd
prefer to keep, if possible, everything as close to the hadoop libraries.

I am sure I am overlooking something basic as repartitioning is a fairly
common operation in MPP environments.
On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <[EMAIL PROTECTED]>wrote:

> If you want to stay in Java look at Cascading. Pig is also helpful. I
> think there are other (Spring integration maybe?) but I'm not familiar with
> them enough to make a recommendation.
>
> Note that with Cascading and Pig you don't write 'map reduce' you write
> logic and they map it to the various mapper/reducer steps automatically.
>
> Hope this helps,
>
> Chris
>
>
> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <[EMAIL PROTECTED]>wrote:
>
>> Howdy,
>>
>> My application requires 2 distinct processing steps (reducers) to be
>> performed on the input data. The first operation generates changes the key
>> values and, records that had different keys in step 1 can end up having the
>> same key in step 2.
>>
>> The heavy lifting of the operation is in step1 and step2 only combines
>> records where keys were changed.
>>
>> In short the overview is:
>> Sequential file -> Step 1 -> Step 2 -> Output.
>>
>>
>> To implement this in hadoop, it seems that I need to create a separate
>> job for each step.
>>
>> Now I assumed, there would some sort of job management under hadoop to
>> link Job 1 and 2, but the only thing I could find was related to job
>> scheduling and nothing on how to synchronize the input/output of the linked
>> jobs.
>>
>>
>>
>> The only crude solution that I can think of is to use a temporary file
>> under HDFS, but even so I'm not sure if this will work.
>>
>> The overview of the process would be:
>> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
>> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
>> (key2, value 3)] => output.
>>
>> Is there a better way to pass the output from Job A as input to Job B
>> (e.g. using network streams or some built in java classes that don't do
>> disk i/o)?
>>
>>
>>
>> The temporary file solution will work in a single node configuration, but
>> I'm not sure about an MPP config.
>>
>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
>> both jobs run on all 4 nodes - will HDFS be able to redistribute
>> automagically the records between nodes or does this need to be coded
>> somehow?
>>
>
>