Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Newbie: Inner join - reduce side


Copy link to this message
-
Re: Newbie: Inner join - reduce side
Tim Robertson 2009-11-12, 15:19
Ok, I missed the org.apache.hadoop.contrib.utils.join which obviously
does this exact thing...

Sorry, answering my own question
Tim
On Thu, Nov 12, 2009 at 4:14 PM, Tim Robertson
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I have 2 KVP files of 200million+ rows, and plan to do a reduce side
> join (my first...).
>
> Input 1
> ----------
> KEY  TC_ID
>
> Input 2
> ----------
> KEY  OCC_ID
>
> I aim to produce an output of:
>
> Output
> ----------
> OCC_ID  TC_ID       (if there are any many2many I would flag an error)
>
>
> My plan was to indicate in the map which source each ID came from
> (e.g. emit tc-123 or occ-234 depending on the input source), and then
> in the reduce pull out the records.
>
> Can someone please sanity check if this approach is sound?  I am
> pretty sure there should be something existing I can use, but I can't
> find it.
>
> Can I determine in the Map which input file the record is coming from
> or do I need multiple jobs?
>
> Many thanks,
> Tim
>