You're not missing anything obvious... what you're trying to do, on face
value, is not an easy thing to do. In M/R, joining is done based on
partitioning to the same reducer...how can you do that if you have a case
and foo is sent to reducer 1, bar to reducer 2? There's no way to know
where keys should be sent.
That said, there are options.
Option 1: a cross. Undesirable because of data explosion.
Option 2: If one of the data sets is large enough to fit in memory, you can
make a UDF that brings it in, and does the join for you. This is
essentially option 1.
Option 3: Less generically, exploit the join you're actually doing. In the
dummy example, it looks like you're checking if a token is contained in
another string. You could convert this into a join by tokenizing,
flattening, doing the join, etc. I don't know how close your real use case
is to what you posted.
2012/8/29 Mat Kelcey <[EMAIL PROTECTED]>
> Considering the following two relations...
> grunt> querys = load 'query' as (id:int, token:chararray);
> grunt> dump querys
> grunt> documents = load 'document' as (id:int, text:chararray);
> grunt> dump documents;
> (21,foo bar frog)
> (22,hello frog)
> Is is possible to do a join where the query:token is not equal to but
> contained in documents:text ?
> (11,foo,21,foo bar frog)
> (12,bar,21,foo bar frog)
> (13,frog,21,foo bar frog)
> (13,frog,22,hello frog)
> I can certainly do this in Java map/reduce (as we all had to in the
> dark days days before pig) but is there a way to hack this together
> with a custom udf or some other weird join backdoor (customer
> partitioner for a group or something whacky) ???
> It's been a long day, maybe I'm just missing some super obvious..