Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Side-loading output from one MR into another?


+
Michael Parker 2012-08-23, 04:42
+
Harsh J 2012-08-23, 05:27
+
Michael Parker 2012-08-23, 06:57
+
Michael Parker 2012-08-23, 22:57
Copy link to this message
-
Re: Side-loading output from one MR into another?
I have map-side join example here

http://askhadoop.blogspot.com/2011/12/map-side-join_27.html

It is a great way to load data into memory on multiple machines
Regards,
Serge

On 8/23/12 3:57 PM, "Michael Parker" <[EMAIL PROTECTED]> wrote:

>Actually, I was able to do some tricks and reduce the size to
>something that can be held in memory.
>
>Nonetheless, if anyone has an example of or more information about a
>map-side join, I would love to see it.
>
>Thanks!
>
>- Mike
>
>
>On Wed, Aug 22, 2012 at 11:57 PM, Michael Parker
><[EMAIL PROTECTED]> wrote:
>> Thanks for the prompt reply!
>>
>> Unfortunately, it's not that small.
>>
>> I'm using the new API; are map side joins accomplished using
>>
>>http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/contrib
>>/utils/join/package-summary.html?
>> Are there any examples which use this package or map side joins?
>>
>> The way I was thinking of doing it was to output the user-to-cohort
>> mapping from the first MR as a SequenceFile, and then each mapper in
>> the second MR could use a SequenceFile.Reader to find the cohort for a
>> user. It seems reasonable, but is this actually doable? It's like a
>> manual map-side join, I suppose, although likely not as elegant as
>> what you were proposing.
>>
>> Thanks,
>> Mike
>>
>> On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>> If it is a small set, you can load it onto distributed cache and then
>>> onto the task's memory, or if its pretty big, perhaps you can do a
>>> map-side join?
>>>
>>> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
>>> <[EMAIL PROTECTED]> wrote:
>>>> Hi all,
>>>>
>>>> Is it possible to take a collection of sorted key-value pairs,
>>>> generated from one MapReduce, and side-load them into another
>>>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>>>> for a given key computed by the first MapReduce?
>>>>
>>>> I need this for a cohort study -- one MR puts users into cohorts, and
>>>> the second MR needs that user-to-cohort mapping to see how cohorts
>>>> behave over time.
>>>>
>>>> Any help would be greatly appreciated. Thanks!
>>>>
>>>> - Mike
>>>
>>>
>>>
>>> --
>>> Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB