The question is not how to sequence all. Cascading could indeed help in
But how to skip the map phase and do the split/local sort directly at the
end of the reduce so that the next reduce need only to do a merge on the
sorted files obtained from the previous reduce. This is basically a
performance optimization (avoid unnecessary network/disk transfers).
Cascading is not equipped to do it, it will only compile the flow into a
sequence of map-reduce.
On Mon, Oct 8, 2012 at 12:44 PM, Fabio Pitzolu <[EMAIL PROTECTED]>wrote:
> Isn't also of some help using Cascading (http://www.cascading.org/) ?
> *Fabio Pitzolu*
> Consultant - BI & Infrastructure
> Mob. +39 3356033776
> Telefono 02 87157239
> Fax. 02 93664786
> *Gruppo Consulenza Innovazione - http://www.gr-ci.com*
> 2012/10/8 Bertrand Dechoux <[EMAIL PROTECTED]>
>> Have you looked at graph processing for Hadoop? Like Hama (
>> http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
>> I can't say for sure it would help you but it seems to be in the same
>> problem domain.
>> With regard to the chaining reducer issue this is indeed a general
>> implementation decision of Hadoop 1.
>> From a purely functional point of view, regardless of performance, I
>> guess it could be shown that a map/reduce/map can be done with a reduce
>> only and that a sequence of map can be done with a single map. Of course,
>> with Hadoop the picture is bit more complex due to the sort phase.
>> map -> sort -> reduce : operations in map/reduce can not generally be
>> transferred due to the sort 'blocking' them when they are related to the
>> sort key
>> reduce -> map : all operations can be performed in the reduce
>> map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce
>> can generally be implemented as
>> map -> sort -> reduce -> sort -> reduce -> sort -> reduce
>> if you are willing to let the possibility of having different scaling
>> options for maps and reduces
>> And that's what you are asking. But with hadoop 1 the map phase is not an
>> option (even though you could use the identify but that's not a wise option
>> with regards to performance like you said). The picture might be changing
>> with Hadoop 2/YARN. I can't provide the details but it may be worth it to
>> look at it.
>> On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <[EMAIL PROTECTED]>wrote:
>>> Hi Harsh,
>>> The hidden map operation which is applied to the reduced partition at
>>> one stage can generate keys that are outside of the range covered by
>>> that particular reducer. I still need to have the many-to-many
>>> communication from reduce step k to reduce step k+1. Otherwise, I
>>> think the ChainReducer would do the job and apply multiple maps to
>>> each isolated partition produced by the reducer.
>>> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>> > Would it then be right to assume that the keys produced by the reduced
>>> > partition at one stage would be isolated to its partition alone and
>>> > not occur in any of the other partition outputs? I'm guessing not,
>>> > based on the nature of your data?
>>> > I'm trying to understand why shuffling is good to be avoided here, and
>>> > if it can be in some ways, given the data. As I see it, you need
>>> > re-sort based on the new key per partition, but not the shuffle? Or am
>>> > I wrong?
>>> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[EMAIL PROTECTED]>
>>> >> Hi Harsh,
>>> >> Yes, there is actually a "hidden" map stage, that generates new
>>> >> <key,value> pairs based on the last reduce output but I can create
>>> >> those records during the reduce step instead and get rid of the
>>> >> intermediate map computation completely. The idea is to apply the map
>>> >> function to each output of the reduce inside the reduce class and emit
>>> >> the result as the output of the reducer.