Tim Robertson 2009-11-12, 15:14
Ok, I missed the org.apache.hadoop.contrib.utils.join which obviously
does this exact thing...
Sorry, answering my own question
On Thu, Nov 12, 2009 at 4:14 PM, Tim Robertson
<[EMAIL PROTECTED]> wrote:
> Hi all,
> I have 2 KVP files of 200million+ rows, and plan to do a reduce side
> join (my first...).
> Input 1
> KEY TC_ID
> Input 2
> KEY OCC_ID
> I aim to produce an output of:
> OCC_ID TC_ID (if there are any many2many I would flag an error)
> My plan was to indicate in the map which source each ID came from
> (e.g. emit tc-123 or occ-234 depending on the input source), and then
> in the reduce pull out the records.
> Can someone please sanity check if this approach is sound? I am
> pretty sure there should be something existing I can use, but I can't
> find it.
> Can I determine in the Map which input file the record is coming from
> or do I need multiple jobs?
> Many thanks,