Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - chaining (the output of) jobs/ reducers


+
Adrian CAPDEFIER 2013-09-12, 13:36
+
Vinod Kumar Vavilapalli 2013-09-13, 04:26
Copy link to this message
-
Re: chaining (the output of) jobs/ reducers
Adrian CAPDEFIER 2013-09-17, 13:23
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.
On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <[EMAIL PROTECTED]
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
+
Adrian CAPDEFIER 2013-09-12, 16:35
+
Bryan Beaudreault 2013-09-12, 17:38
+
Adrian CAPDEFIER 2013-09-12, 19:02
+
Bryan Beaudreault 2013-09-12, 19:49
+
Venkata K Pisupat 2013-09-12, 20:07
+
Shahab Yunus 2013-09-12, 17:33