Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce

Copy link to this message
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Mike, just FYI, it's my 08's approach[1].

1. https://blogs.apache.org/hama/entry/how_will_hama_bsp_different

On Tue, Oct 9, 2012 at 7:50 AM, Michael Segel <[EMAIL PROTECTED]> wrote:
> Jim,
> You can use the combiner as a reducer albeit you won't get down to a single reduce file output. But you don't need that.
> As long as the output from the combiner matches the input to the next reducer you should be ok.
> Without knowing the specifics, all I can say is TANSTAAFL that is to say that in a map/reduce paradigm you need to have some sort of mapper.
> I would also point you to look at using HBase and temp tables. While the writes have more overhead than writing directly to HDFS, it may make things a bit  more interesting.
> Again, the usual caveats about YMMV and things.
> -Mike
> On Oct 8, 2012, at 3:53 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
>> Hi Mike,
>> I'm already doing that but the output of the reduce goes straight back
>> to HDFS to be consumed by the next Identity Mapper. Combiners just
>> reduce the amount of data between map and reduce whereas I'm looking
>> for an optimization between reduce and map.
>> Jim
>> On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
>>> Well I was thinking ...
>>> Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer -> Identity Mapper -> combiner -> reducer...
>>> May make things easier.
>>> HTH
>>> 0Mike
>>> On Oct 8, 2012, at 2:09 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
>>>> Thank you for the comments. Some similar frameworks I looked at
>>>> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing
>>>> large scale graph processing so I assumed one of them could serve the
>>>> purpose. Here is a summary of what I found out about them that is
>>>> relevant:
>>>> 1) Haloop and Twister: They cache static data among a chain of
>>>> MapReduce jobs. The main contribution is to reduce the intermediate
>>>> data shipped from mappers to reducers. Still, the output of each
>>>> reduce goes to the file system.
>>>> 2) Cascading: A higher level API to create MapReduce workflows.
>>>> Anything you can do with Cascading can be done practically by more
>>>> programing effort and using Hadoop only. Bypassing map and running a
>>>> chain of sort->reduce->sort->reduce jobs is not possible. Please
>>>> correct me if I'm wrong.
>>>> 3) Giraph: Built on the BSP model and is very similar to Pregel. I
>>>> couldn't find a detailed overview of their architecture but my
>>>> understanding is that your data needs to fit in distributed memory,
>>>> which is also true for Pregel.
>>>> 4) Hama: Also follows the BSP model. I don't know how the intermediate
>>>> data is serialized and passed to the next set of nodes and whether it
>>>> is possible to do a performance optimization similar to what I am
>>>> asking for. If anyone who used Hama can point a few articles about how
>>>> the framework actually works and handles the messages passed between
>>>> vertices, I'd really appreciate that.
>>>> Conclusion: None of the above tools can bypass the map step or do a
>>>> similar performance optimization. Of course Giraph and Hama are built
>>>> on a different model - not really MapReduce - so it is not very
>>>> accurate to say that they don't have the required functionality.
>>>> If I'm missing anything and.or if there are folks who used Giraph or
>>>> Hama and think that they might serve the purpose, I'd be glad to hear
>>>> more.
>>>> Jim
>>>> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <[EMAIL PROTECTED]> wrote:
>>>>> I don't believe that Hama would suffice.
>>>>> In terms of M/R where you want to chain reducers...
>>>>> Can you chain combiners? (I don't think so, but you never know)
>>>>> If not, you end up with a series of M/R jobs and the Mappers are just identity mappers.
>>>>> Or you could use HBase, with a small caveat... you have to be careful not to use speculative execution and that if a task fails, that the results of the task won't be affected if they are run a second time. Meaning that they will just overwrite the data in a column with a second cell and that you don't care about the number of versions.

Best Regards, Edward J. Yoon