Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Accumulo and Mapreduce


Copy link to this message
-
Re: Accumulo and Mapreduce
Ted Dunning 2013-03-04, 19:43
Chaining the jobs is a fantastically inefficient solution.  If you use Pig
or Cascading, the optimizer will glue all of your map functions into a
single mapper.  The result is something like:

    (mapper1 -> mapper2 -> mapper3) => reducer

Here the parentheses indicate that all of the map functions are executed as
a single function formed by composing mapper1, mapper2, and mapper3.
 Writing multiple jobs to do this forces *lots* of unnecessary traffic to
your persistent store and lots of unnecessary synchronization.

You can do this optimization by hand, but using a higher level language is
often better for maintenance.
On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
> or Hive. You can do this is a couple lines of code, I suspect. Two map
> reduce jobs should not pose any kind of challenge with the right tools.
>
>
> On Monday, March 4, 2013, Sandy Ryza wrote:
>
>> Hi Aji,
>>
>> Oozie is a mature project for managing MapReduce workflows.
>> http://oozie.apache.org/
>>
>> -Sandy
>>
>>
>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <[EMAIL PROTECTED]>wrote:
>>
>>> Aji,
>>>
>>> Why don't you just chain the jobs together?
>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>
>>> Justin
>>>
>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>> > Russell thanks for the link.
>>> >
>>> > I am interested in finding a solution (if out there) where Mapper1
>>> outputs a
>>> > custom object and Mapper 2 can use that as input. One way to do this
>>> > obviously by writing to Accumulo, in my case. But, is there another
>>> solution
>>> > for this:
>>> >
>>> > List<MyObject> ----> Input to Job
>>> >
>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>> <MyObjectId,
>>> > MyObject>
>>> >
>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>> >
>>> >
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>> [EMAIL PROTECTED]>
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>> >>
>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>> try
>>> >> it.
>>> >>
>>> >> Russell Jurney http://datasyndrome.com
>>> >>
>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>> output goes
>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>> >>
>>> >> Questions:
>>> >>
>>> >> 1) Has any one tried something like this before? Are there any
>>> workflow
>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>> job like
>>> >> this. Or am I limited to use Quartz for this?
>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>> mapreduce
>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>> (starting
>>> >> point/best practices).
>>> >>
>>> >> Thank you in advance for any suggestions!
>>> >>
>>> >> Aji
>>> >>
>>> >
>>>
>>
>>
>
> --
> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED] datasyndrome.
> com
>