Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce


Copy link to this message
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Michael Segel 2012-10-08, 22:50
Jim,

You can use the combiner as a reducer albeit you won't get down to a single reduce file output. But you don't need that.
As long as the output from the combiner matches the input to the next reducer you should be ok.

Without knowing the specifics, all I can say is TANSTAAFL that is to say that in a map/reduce paradigm you need to have some sort of mapper.

I would also point you to look at using HBase and temp tables. While the writes have more overhead than writing directly to HDFS, it may make things a bit  more interesting.

Again, the usual caveats about YMMV and things.

-Mike

On Oct 8, 2012, at 3:53 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:

> Hi Mike,
>
> I'm already doing that but the output of the reduce goes straight back
> to HDFS to be consumed by the next Identity Mapper. Combiners just
> reduce the amount of data between map and reduce whereas I'm looking
> for an optimization between reduce and map.
>
> Jim
>
> On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
>> Well I was thinking ...
>>
>> Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer -> Identity Mapper -> combiner -> reducer...
>>
>> May make things easier.
>>
>> HTH
>>
>> 0Mike
>>
>> On Oct 8, 2012, at 2:09 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
>>
>>> Thank you for the comments. Some similar frameworks I looked at
>>> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing
>>> large scale graph processing so I assumed one of them could serve the
>>> purpose. Here is a summary of what I found out about them that is
>>> relevant:
>>>
>>> 1) Haloop and Twister: They cache static data among a chain of
>>> MapReduce jobs. The main contribution is to reduce the intermediate
>>> data shipped from mappers to reducers. Still, the output of each
>>> reduce goes to the file system.
>>>
>>> 2) Cascading: A higher level API to create MapReduce workflows.
>>> Anything you can do with Cascading can be done practically by more
>>> programing effort and using Hadoop only. Bypassing map and running a
>>> chain of sort->reduce->sort->reduce jobs is not possible. Please
>>> correct me if I'm wrong.
>>>
>>> 3) Giraph: Built on the BSP model and is very similar to Pregel. I
>>> couldn't find a detailed overview of their architecture but my
>>> understanding is that your data needs to fit in distributed memory,
>>> which is also true for Pregel.
>>>
>>> 4) Hama: Also follows the BSP model. I don't know how the intermediate
>>> data is serialized and passed to the next set of nodes and whether it
>>> is possible to do a performance optimization similar to what I am
>>> asking for. If anyone who used Hama can point a few articles about how
>>> the framework actually works and handles the messages passed between
>>> vertices, I'd really appreciate that.
>>>
>>> Conclusion: None of the above tools can bypass the map step or do a
>>> similar performance optimization. Of course Giraph and Hama are built
>>> on a different model - not really MapReduce - so it is not very
>>> accurate to say that they don't have the required functionality.
>>>
>>> If I'm missing anything and.or if there are folks who used Giraph or
>>> Hama and think that they might serve the purpose, I'd be glad to hear
>>> more.
>>>
>>> Jim
>>>
>>> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <[EMAIL PROTECTED]> wrote:
>>>> I don't believe that Hama would suffice.
>>>>
>>>> In terms of M/R where you want to chain reducers...
>>>> Can you chain combiners? (I don't think so, but you never know)
>>>>
>>>> If not, you end up with a series of M/R jobs and the Mappers are just identity mappers.
>>>>
>>>> Or you could use HBase, with a small caveat... you have to be careful not to use speculative execution and that if a task fails, that the results of the task won't be affected if they are run a second time. Meaning that they will just overwrite the data in a column with a second cell and that you don't care about the number of versions.