Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Chaning Multiple Reducers: Reduce -> Reduce -> Reduce


+
Jim Twensky 2012-10-05, 16:31
+
Harsh J 2012-10-05, 17:18
+
Jim Twensky 2012-10-05, 17:43
+
Harsh J 2012-10-05, 17:54
+
Jim Twensky 2012-10-05, 18:02
Copy link to this message
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Have you looked at graph processing for Hadoop? Like Hama (
http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
I can't say for sure it would help you but it seems to be in the same
problem domain.

With regard to the chaining reducer issue this is indeed a general
implementation decision of Hadoop 1.
>From a purely functional point of view, regardless of performance, I guess
it could be shown that a map/reduce/map can be done with a reduce only and
that a sequence of map can be done with a single map. Of course, with
Hadoop the picture is bit more complex due to the sort phase.

map -> sort -> reduce : operations in map/reduce can not generally be
transferred due to the sort 'blocking' them when they are related to the
sort key
reduce -> map : all operations can be performed in the reduce
So
map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce
can generally be implemented as
map -> sort -> reduce -> sort -> reduce -> sort -> reduce
if you are willing to let the possibility of having different scaling
options for maps and reduces

And that's what you are asking. But with hadoop 1 the map phase is not an
option (even though you could use the identify but that's not a wise option
with regards to performance like you said). The picture might be changing
with Hadoop 2/YARN. I can't provide the details but it may be worth it to
look at it.

Regards

Bertrand

On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:

> Hi Harsh,
>
> The hidden map operation which is applied to the reduced partition at
> one stage can generate keys that are outside of the range covered by
> that particular reducer. I still need to have the many-to-many
> communication from reduce step k to reduce step k+1. Otherwise, I
> think the ChainReducer would do the job and apply multiple maps to
> each isolated partition produced by the reducer.
>
> Jim
>
> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> > Would it then be right to assume that the keys produced by the reduced
> > partition at one stage would be isolated to its partition alone and
> > not occur in any of the other partition outputs? I'm guessing not,
> > based on the nature of your data?
> >
> > I'm trying to understand why shuffling is good to be avoided here, and
> > if it can be in some ways, given the data. As I see it, you need
> > re-sort based on the new key per partition, but not the shuffle? Or am
> > I wrong?
> >
> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[EMAIL PROTECTED]>
> wrote:
> >> Hi Harsh,
> >>
> >> Yes, there is actually a "hidden" map stage, that generates new
> >> <key,value> pairs based on the last reduce output but I can create
> >> those records during the reduce step instead and get rid of the
> >> intermediate map computation completely. The idea is to apply the map
> >> function to each output of the reduce inside the reduce class and emit
> >> the result as the output of the reducer.
> >>
> >> Jim
> >>
> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>> Hey Jim,
> >>>
> >>> Are you looking to re-sort or re-partition your data by a different
> >>> key or key combo after each output from reduce?
> >>>
> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[EMAIL PROTECTED]>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I have a complex Hadoop job that iterates over  large graph data
> >>>> multiple times until some convergence condition is met. I know that
> >>>> the map output goes to the local disk of each particular mapper first,
> >>>> and then fetched by the reducers before the reduce tasks start. I can
> >>>> see that this is an overhead, and it theory we can ship the data
> >>>> directly from mappers to reducers, without serializing on the local
> >>>> disk first. I understand that this step is necessary for fault
> >>>> tolerance and it is an essential building block of MapReduce.
> >>>>
> >>>> In my application, the map process consists of identity mappers which

Bertrand Dechoux
+
Fabio Pitzolu 2012-10-08, 10:44
+
Bertrand Dechoux 2012-10-08, 10:51
+
Jim Twensky 2012-10-08, 19:09
+
Michael Segel 2012-10-08, 19:19
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB