Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Chaning Multiple Reducers: Reduce -> Reduce -> Reduce


Copy link to this message
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Isn't also of some help using Cascading (http://www.cascading.org/) ?

*Fabio Pitzolu*
Consultant - BI & Infrastructure

Mob. +39 3356033776
Telefono 02 87157239
Fax. 02 93664786

*Gruppo Consulenza Innovazione - http://www.gr-ci.com*

2012/10/8 Bertrand Dechoux <[EMAIL PROTECTED]>

> Have you looked at graph processing for Hadoop? Like Hama (
> http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
> I can't say for sure it would help you but it seems to be in the same
> problem domain.
>
> With regard to the chaining reducer issue this is indeed a general
> implementation decision of Hadoop 1.
> From a purely functional point of view, regardless of performance, I guess
> it could be shown that a map/reduce/map can be done with a reduce only and
> that a sequence of map can be done with a single map. Of course, with
> Hadoop the picture is bit more complex due to the sort phase.
>
> map -> sort -> reduce : operations in map/reduce can not generally be
> transferred due to the sort 'blocking' them when they are related to the
> sort key
> reduce -> map : all operations can be performed in the reduce
> So
> map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce
> can generally be implemented as
> map -> sort -> reduce -> sort -> reduce -> sort -> reduce
> if you are willing to let the possibility of having different scaling
> options for maps and reduces
>
> And that's what you are asking. But with hadoop 1 the map phase is not an
> option (even though you could use the identify but that's not a wise option
> with regards to performance like you said). The picture might be changing
> with Hadoop 2/YARN. I can't provide the details but it may be worth it to
> look at it.
>
> Regards
>
> Bertrand
>
>
> On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
>
>> Hi Harsh,
>>
>> The hidden map operation which is applied to the reduced partition at
>> one stage can generate keys that are outside of the range covered by
>> that particular reducer. I still need to have the many-to-many
>> communication from reduce step k to reduce step k+1. Otherwise, I
>> think the ChainReducer would do the job and apply multiple maps to
>> each isolated partition produced by the reducer.
>>
>> Jim
>>
>> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> > Would it then be right to assume that the keys produced by the reduced
>> > partition at one stage would be isolated to its partition alone and
>> > not occur in any of the other partition outputs? I'm guessing not,
>> > based on the nature of your data?
>> >
>> > I'm trying to understand why shuffling is good to be avoided here, and
>> > if it can be in some ways, given the data. As I see it, you need
>> > re-sort based on the new key per partition, but not the shuffle? Or am
>> > I wrong?
>> >
>> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[EMAIL PROTECTED]>
>> wrote:
>> >> Hi Harsh,
>> >>
>> >> Yes, there is actually a "hidden" map stage, that generates new
>> >> <key,value> pairs based on the last reduce output but I can create
>> >> those records during the reduce step instead and get rid of the
>> >> intermediate map computation completely. The idea is to apply the map
>> >> function to each output of the reduce inside the reduce class and emit
>> >> the result as the output of the reducer.
>> >>
>> >> Jim
>> >>
>> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> >>> Hey Jim,
>> >>>
>> >>> Are you looking to re-sort or re-partition your data by a different
>> >>> key or key combo after each output from reduce?
>> >>>
>> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[EMAIL PROTECTED]>
>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> I have a complex Hadoop job that iterates over  large graph data
>> >>>> multiple times until some convergence condition is met. I know that
>> >>>> the map output goes to the local disk of each particular mapper
>> first,
>> >>>> and then fetched by the reducers before the reduce tasks start. I can
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB