Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce


+
Edward J. Yoon 2012-10-08, 11:35
+
Michael Segel 2012-10-08, 11:52
+
Jim Twensky 2012-10-08, 20:53
+
Michael Segel 2012-10-08, 22:50
+
Edward J. Yoon 2012-10-08, 23:03
Copy link to this message
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Edward J. Yoon 2012-10-08, 22:45
> asking for. If anyone who used Hama can point a few articles about how
> the framework actually works and handles the messages passed between
> vertices, I'd really appreciate that.

Hama Architecture:
https://issues.apache.org/jira/secure/attachment/12528219/ApacheHamaDesign.pdf

Hama BSP programming model:
https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf

On Tue, Oct 9, 2012 at 4:09 AM, Jim Twensky <[EMAIL PROTECTED]> wrote:
> Thank you for the comments. Some similar frameworks I looked at
> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing
> large scale graph processing so I assumed one of them could serve the
> purpose. Here is a summary of what I found out about them that is
> relevant:
>
> 1) Haloop and Twister: They cache static data among a chain of
> MapReduce jobs. The main contribution is to reduce the intermediate
> data shipped from mappers to reducers. Still, the output of each
> reduce goes to the file system.
>
> 2) Cascading: A higher level API to create MapReduce workflows.
> Anything you can do with Cascading can be done practically by more
> programing effort and using Hadoop only. Bypassing map and running a
> chain of sort->reduce->sort->reduce jobs is not possible. Please
> correct me if I'm wrong.
>
> 3) Giraph: Built on the BSP model and is very similar to Pregel. I
> couldn't find a detailed overview of their architecture but my
> understanding is that your data needs to fit in distributed memory,
> which is also true for Pregel.
>
> 4) Hama: Also follows the BSP model. I don't know how the intermediate
> data is serialized and passed to the next set of nodes and whether it
> is possible to do a performance optimization similar to what I am
> asking for. If anyone who used Hama can point a few articles about how
> the framework actually works and handles the messages passed between
> vertices, I'd really appreciate that.
>
> Conclusion: None of the above tools can bypass the map step or do a
> similar performance optimization. Of course Giraph and Hama are built
> on a different model - not really MapReduce - so it is not very
> accurate to say that they don't have the required functionality.
>
> If I'm missing anything and.or if there are folks who used Giraph or
> Hama and think that they might serve the purpose, I'd be glad to hear
> more.
>
> Jim
>
> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <[EMAIL PROTECTED]> wrote:
>> I don't believe that Hama would suffice.
>>
>> In terms of M/R where you want to chain reducers...
>> Can you chain combiners? (I don't think so, but you never know)
>>
>> If not, you end up with a series of M/R jobs and the Mappers are just identity mappers.
>>
>> Or you could use HBase, with a small caveat... you have to be careful not to use speculative execution and that if a task fails, that the results of the task won't be affected if they are run a second time. Meaning that they will just overwrite the data in a column with a second cell and that you don't care about the number of versions.
>>
>> Note: HBase doesn't have transactions, so you would have to think about how to tag cells so that if a task dies, upon restart, you can remove the affected cells.  Along with some post job synchronization...
>>
>> Again HBase may work, but there may also be additional problems that could impact your results. It will have to be evaluated on a case by case basis.
>>
>>
>> JMHO
>>
>> -Mike
>>
>> On Oct 8, 2012, at 6:35 AM, Edward J. Yoon <[EMAIL PROTECTED]> wrote:
>>
>>>> call context.write() in my mapper class)? If not, are there any other
>>>> MR platforms that can do this? I've been searching around and couldn't
>>>
>>> You can use Hama BSP[1] instead of Map/Reduce.
>>>
>>> No stable release yet but I confirmed that large graph with billions
>>> of nodes and edges can be crunched in few minutes[2].
>>>
>>> 1. http://hama.apache.org
>>> 2. http://wiki.apache.org/hama/Benchmarks
>>>
>>> On Sat, Oct 6, 2012 at 1:31 AM, Jim Twensky <[EMAIL PROTECTED]> wrote:

Best Regards, Edward J. Yoon
@eddieyoon
+
Edward J. Yoon 2012-10-08, 22:47