Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Can MapReduce run simultaneous producer/consumer processes?


Copy link to this message
-
Re: Can MapReduce run simultaneous producer/consumer processes?
In case the producer / consumer don't require sorting to happen, take a look
at ChainMapper:

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/ChainMapper.html

If you do want the stuff to happen after sorting, take a look at:

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/lib/chain/ChainReducer.html

More esoteric cases will require you to do separate map reduces, at least
with the current hadoop framework.
On Thu, Jan 6, 2011 at 2:27 PM, W.P. McNeill <[EMAIL PROTECTED]> wrote:

> Say I have two MapReduce processes, A and B.  The two are algorithmically
> dissimilar, so they have to be implemented as separate MapReduce processes.
>  The output of A is used as the input of B, so A has to run first.
>  However,
> B doesn't need to take all of A's output as input, only a partition of it.
>  So in theory A and B could run at the same time in a producer/consumer
> arrangement, where B would start to work as soon as A had produced some
> output but before A had completed.  Obviously, this could be a big
> parallelization win.
>
> Is this possible in MapReduce?  I know at the most basic level it is
> not–there is no synchronization mechanism that allows the same HDFS
> directory to be used for both input and output–but is there some
> abstraction
> layer on top that allows it?  I've been digging around, and I think the
> answer is "No" but I want to be sure.
>
> More specifically, the only abstraction layer I'm aware of that chains
> together MapReduce processes is Cascade, and I think it requires the reduce
> steps to be serialized, but again I'm not sure because I've only read the
> documentation and haven't actually played with it.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB