Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Chaning Multiple Reducers: Reduce -> Reduce -> Reduce


+
Jim Twensky 2012-10-05, 16:31
+
Harsh J 2012-10-05, 17:18
+
Jim Twensky 2012-10-05, 17:43
+
Harsh J 2012-10-05, 17:54
+
Jim Twensky 2012-10-05, 18:02
+
Bertrand Dechoux 2012-10-08, 10:39
+
Fabio Pitzolu 2012-10-08, 10:44
Copy link to this message
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
The question is not how to sequence all. Cascading could indeed help in
that case.

But how to skip the map phase and do the split/local sort directly at the
end of the reduce so that the next reduce need only to do a merge on the
sorted files obtained from the previous reduce. This is basically a
performance optimization (avoid unnecessary network/disk transfers).
Cascading is not equipped to do it, it will only compile the flow into a
sequence of map-reduce.

Regards

Bertrand

On Mon, Oct 8, 2012 at 12:44 PM, Fabio Pitzolu <[EMAIL PROTECTED]>wrote:

> Isn't also of some help using Cascading (http://www.cascading.org/) ?
>
> *Fabio Pitzolu*
> Consultant - BI & Infrastructure
>
> Mob. +39 3356033776
> Telefono 02 87157239
> Fax. 02 93664786
>
> *Gruppo Consulenza Innovazione - http://www.gr-ci.com*
>
>
>
>
> 2012/10/8 Bertrand Dechoux <[EMAIL PROTECTED]>
>
>> Have you looked at graph processing for Hadoop? Like Hama (
>> http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
>> I can't say for sure it would help you but it seems to be in the same
>> problem domain.
>>
>> With regard to the chaining reducer issue this is indeed a general
>> implementation decision of Hadoop 1.
>> From a purely functional point of view, regardless of performance, I
>> guess it could be shown that a map/reduce/map can be done with a reduce
>> only and that a sequence of map can be done with a single map. Of course,
>> with Hadoop the picture is bit more complex due to the sort phase.
>>
>> map -> sort -> reduce : operations in map/reduce can not generally be
>> transferred due to the sort 'blocking' them when they are related to the
>> sort key
>> reduce -> map : all operations can be performed in the reduce
>> So
>> map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce
>> can generally be implemented as
>> map -> sort -> reduce -> sort -> reduce -> sort -> reduce
>> if you are willing to let the possibility of having different scaling
>> options for maps and reduces
>>
>> And that's what you are asking. But with hadoop 1 the map phase is not an
>> option (even though you could use the identify but that's not a wise option
>> with regards to performance like you said). The picture might be changing
>> with Hadoop 2/YARN. I can't provide the details but it may be worth it to
>> look at it.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[EMAIL PROTECTED]>wrote:
>>
>>> Hi Harsh,
>>>
>>> The hidden map operation which is applied to the reduced partition at
>>> one stage can generate keys that are outside of the range covered by
>>> that particular reducer. I still need to have the many-to-many
>>> communication from reduce step k to reduce step k+1. Otherwise, I
>>> think the ChainReducer would do the job and apply multiple maps to
>>> each isolated partition produced by the reducer.
>>>
>>> Jim
>>>
>>> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>> > Would it then be right to assume that the keys produced by the reduced
>>> > partition at one stage would be isolated to its partition alone and
>>> > not occur in any of the other partition outputs? I'm guessing not,
>>> > based on the nature of your data?
>>> >
>>> > I'm trying to understand why shuffling is good to be avoided here, and
>>> > if it can be in some ways, given the data. As I see it, you need
>>> > re-sort based on the new key per partition, but not the shuffle? Or am
>>> > I wrong?
>>> >
>>> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[EMAIL PROTECTED]>
>>> wrote:
>>> >> Hi Harsh,
>>> >>
>>> >> Yes, there is actually a "hidden" map stage, that generates new
>>> >> <key,value> pairs based on the last reduce output but I can create
>>> >> those records during the reduce step instead and get rid of the
>>> >> intermediate map computation completely. The idea is to apply the map
>>> >> function to each output of the reduce inside the reduce class and emit
>>> >> the result as the output of the reducer.
Bertrand Dechoux
+
Jim Twensky 2012-10-08, 19:09
+
Michael Segel 2012-10-08, 19:19
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB