|
|
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> ReduceEdward J. Yoon 2012-10-08, 23:03
Mike, just FYI, it's my 08's approach[1].
1. https://blogs.apache.org/hama/entry/how_will_hama_bsp_different On Tue, Oct 9, 2012 at 7:50 AM, Michael Segel <[EMAIL PROTECTED]> wrote: > Jim, > > You can use the combiner as a reducer albeit you won't get down to a single reduce file output. But you don't need that. > As long as the output from the combiner matches the input to the next reducer you should be ok. > > Without knowing the specifics, all I can say is TANSTAAFL that is to say that in a map/reduce paradigm you need to have some sort of mapper. > > I would also point you to look at using HBase and temp tables. While the writes have more overhead than writing directly to HDFS, it may make things a bit more interesting. > > Again, the usual caveats about YMMV and things. > > -Mike > > On Oct 8, 2012, at 3:53 PM, Jim Twensky <[EMAIL PROTECTED]> wrote: > >> Hi Mike, >> >> I'm already doing that but the output of the reduce goes straight back >> to HDFS to be consumed by the next Identity Mapper. Combiners just >> reduce the amount of data between map and reduce whereas I'm looking >> for an optimization between reduce and map. >> >> Jim >> >> On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel <[EMAIL PROTECTED]> wrote: >>> Well I was thinking ... >>> >>> Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer -> Identity Mapper -> combiner -> reducer... >>> >>> May make things easier. >>> >>> HTH >>> >>> 0Mike >>> >>> On Oct 8, 2012, at 2:09 PM, Jim Twensky <[EMAIL PROTECTED]> wrote: >>> >>>> Thank you for the comments. Some similar frameworks I looked at >>>> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing >>>> large scale graph processing so I assumed one of them could serve the >>>> purpose. Here is a summary of what I found out about them that is >>>> relevant: >>>> >>>> 1) Haloop and Twister: They cache static data among a chain of >>>> MapReduce jobs. The main contribution is to reduce the intermediate >>>> data shipped from mappers to reducers. Still, the output of each >>>> reduce goes to the file system. >>>> >>>> 2) Cascading: A higher level API to create MapReduce workflows. >>>> Anything you can do with Cascading can be done practically by more >>>> programing effort and using Hadoop only. Bypassing map and running a >>>> chain of sort->reduce->sort->reduce jobs is not possible. Please >>>> correct me if I'm wrong. >>>> >>>> 3) Giraph: Built on the BSP model and is very similar to Pregel. I >>>> couldn't find a detailed overview of their architecture but my >>>> understanding is that your data needs to fit in distributed memory, >>>> which is also true for Pregel. >>>> >>>> 4) Hama: Also follows the BSP model. I don't know how the intermediate >>>> data is serialized and passed to the next set of nodes and whether it >>>> is possible to do a performance optimization similar to what I am >>>> asking for. If anyone who used Hama can point a few articles about how >>>> the framework actually works and handles the messages passed between >>>> vertices, I'd really appreciate that. >>>> >>>> Conclusion: None of the above tools can bypass the map step or do a >>>> similar performance optimization. Of course Giraph and Hama are built >>>> on a different model - not really MapReduce - so it is not very >>>> accurate to say that they don't have the required functionality. >>>> >>>> If I'm missing anything and.or if there are folks who used Giraph or >>>> Hama and think that they might serve the purpose, I'd be glad to hear >>>> more. >>>> >>>> Jim >>>> >>>> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <[EMAIL PROTECTED]> wrote: >>>>> I don't believe that Hama would suffice. >>>>> >>>>> In terms of M/R where you want to chain reducers... >>>>> Can you chain combiners? (I don't think so, but you never know) >>>>> >>>>> If not, you end up with a series of M/R jobs and the Mappers are just identity mappers. >>>>> >>>>> Or you could use HBase, with a small caveat... you have to be careful not to use speculative execution and that if a task fails, that the results of the task won't be affected if they are run a second time. Meaning that they will just overwrite the data in a column with a second cell and that you don't care about the number of versions. Best Regards, Edward J. Yoon @eddieyoon |