Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> chaining (the output of) jobs/ reducers


+
Adrian CAPDEFIER 2013-09-12, 13:36
+
Vinod Kumar Vavilapalli 2013-09-13, 04:26
Copy link to this message
-
Re: chaining (the output of) jobs/ reducers
I've just seen your email, Vinod. This is the behaviour that I'd expect and
similar to other data integration tools; I will keep an eye out for it as a
long term option.
On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli <[EMAIL PROTECTED]
> wrote:

>
> Other than the short term solutions that others have proposed, Apache Tez
> solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers
> and reducers, and your own custom processors - all without persisting the
> intermediate outputs to HDFS.
>
> It works on top of YARN, though the first release of Tez is yet to happen.
>
> You can learn about it more here: http://tez.incubator.apache.org/
>
> HTH,
> +Vinod
>
> On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
>
> Howdy,
>
> My application requires 2 distinct processing steps (reducers) to be
> performed on the input data. The first operation generates changes the key
> values and, records that had different keys in step 1 can end up having the
> same key in step 2.
>
> The heavy lifting of the operation is in step1 and step2 only combines
> records where keys were changed.
>
> In short the overview is:
> Sequential file -> Step 1 -> Step 2 -> Output.
>
>
> To implement this in hadoop, it seems that I need to create a separate job
> for each step.
>
> Now I assumed, there would some sort of job management under hadoop to
> link Job 1 and 2, but the only thing I could find was related to job
> scheduling and nothing on how to synchronize the input/output of the linked
> jobs.
>
>
>
> The only crude solution that I can think of is to use a temporary file
> under HDFS, but even so I'm not sure if this will work.
>
> The overview of the process would be:
> Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer
> (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer
> (key2, value 3)] => output.
>
> Is there a better way to pass the output from Job A as input to Job B
> (e.g. using network streams or some built in java classes that don't do
> disk i/o)?
>
>
>
> The temporary file solution will work in a single node configuration, but
> I'm not sure about an MPP config.
>
> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or
> both jobs run on all 4 nodes - will HDFS be able to redistribute
> automagically the records between nodes or does this need to be coded
> somehow?
>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
+
Adrian CAPDEFIER 2013-09-12, 16:35
+
Bryan Beaudreault 2013-09-12, 17:38
+
Adrian CAPDEFIER 2013-09-12, 19:02
+
Bryan Beaudreault 2013-09-12, 19:49
+
Venkata K Pisupat 2013-09-12, 20:07
+
Shahab Yunus 2013-09-12, 17:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB