Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Newbie: Inner join - reduce side


Copy link to this message
-
Re: Newbie: Inner join - reduce side
Ok, I missed the org.apache.hadoop.contrib.utils.join which obviously
does this exact thing...

Sorry, answering my own question
Tim
On Thu, Nov 12, 2009 at 4:14 PM, Tim Robertson
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I have 2 KVP files of 200million+ rows, and plan to do a reduce side
> join (my first...).
>
> Input 1
> ----------
> KEY  TC_ID
>
> Input 2
> ----------
> KEY  OCC_ID
>
> I aim to produce an output of:
>
> Output
> ----------
> OCC_ID  TC_ID       (if there are any many2many I would flag an error)
>
>
> My plan was to indicate in the map which source each ID came from
> (e.g. emit tc-123 or occ-234 depending on the input source), and then
> in the reduce pull out the records.
>
> Can someone please sanity check if this approach is sound?  I am
> pretty sure there should be something existing I can use, but I can't
> find it.
>
> Can I determine in the Map which input file the record is coming from
> or do I need multiple jobs?
>
> Many thanks,
> Tim
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB