|
|
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> ReduceEdward J. Yoon 2012-10-08, 22:47
P.S., giraph is different in the sense that it runs as a map-only job.
On Tue, Oct 9, 2012 at 7:45 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: >> asking for. If anyone who used Hama can point a few articles about how >> the framework actually works and handles the messages passed between >> vertices, I'd really appreciate that. > > Hama Architecture: > https://issues.apache.org/jira/secure/attachment/12528219/ApacheHamaDesign.pdf > > Hama BSP programming model: > https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf > > On Tue, Oct 9, 2012 at 4:09 AM, Jim Twensky <[EMAIL PROTECTED]> wrote: >> Thank you for the comments. Some similar frameworks I looked at >> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing >> large scale graph processing so I assumed one of them could serve the >> purpose. Here is a summary of what I found out about them that is >> relevant: >> >> 1) Haloop and Twister: They cache static data among a chain of >> MapReduce jobs. The main contribution is to reduce the intermediate >> data shipped from mappers to reducers. Still, the output of each >> reduce goes to the file system. >> >> 2) Cascading: A higher level API to create MapReduce workflows. >> Anything you can do with Cascading can be done practically by more >> programing effort and using Hadoop only. Bypassing map and running a >> chain of sort->reduce->sort->reduce jobs is not possible. Please >> correct me if I'm wrong. >> >> 3) Giraph: Built on the BSP model and is very similar to Pregel. I >> couldn't find a detailed overview of their architecture but my >> understanding is that your data needs to fit in distributed memory, >> which is also true for Pregel. >> >> 4) Hama: Also follows the BSP model. I don't know how the intermediate >> data is serialized and passed to the next set of nodes and whether it >> is possible to do a performance optimization similar to what I am >> asking for. If anyone who used Hama can point a few articles about how >> the framework actually works and handles the messages passed between >> vertices, I'd really appreciate that. >> >> Conclusion: None of the above tools can bypass the map step or do a >> similar performance optimization. Of course Giraph and Hama are built >> on a different model - not really MapReduce - so it is not very >> accurate to say that they don't have the required functionality. >> >> If I'm missing anything and.or if there are folks who used Giraph or >> Hama and think that they might serve the purpose, I'd be glad to hear >> more. >> >> Jim >> >> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <[EMAIL PROTECTED]> wrote: >>> I don't believe that Hama would suffice. >>> >>> In terms of M/R where you want to chain reducers... >>> Can you chain combiners? (I don't think so, but you never know) >>> >>> If not, you end up with a series of M/R jobs and the Mappers are just identity mappers. >>> >>> Or you could use HBase, with a small caveat... you have to be careful not to use speculative execution and that if a task fails, that the results of the task won't be affected if they are run a second time. Meaning that they will just overwrite the data in a column with a second cell and that you don't care about the number of versions. >>> >>> Note: HBase doesn't have transactions, so you would have to think about how to tag cells so that if a task dies, upon restart, you can remove the affected cells. Along with some post job synchronization... >>> >>> Again HBase may work, but there may also be additional problems that could impact your results. It will have to be evaluated on a case by case basis. >>> >>> >>> JMHO >>> >>> -Mike >>> >>> On Oct 8, 2012, at 6:35 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: >>> >>>>> call context.write() in my mapper class)? If not, are there any other >>>>> MR platforms that can do this? I've been searching around and couldn't >>>> >>>> You can use Hama BSP[1] instead of Map/Reduce. >>>> >> Best Regards, Edward J. Yoon @eddieyoon |