Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Custom joins

Mat Kelcey 2012-08-29, 23:55
Mat Kelcey 2012-08-30, 00:08
Copy link to this message
Re: Custom joins
You're not missing anything obvious... what you're trying to do, on face
value, is not an easy thing to do. In M/R, joining is done based on
partitioning to the same reducer...how can you do that if you have a case


foo bar

and foo is sent to reducer 1, bar to reducer 2? There's no way to know
where keys should be sent.

That said, there are options.

Option 1: a cross. Undesirable because of data explosion.
Option 2: If one of the data sets is large enough to fit in memory, you can
make a UDF that brings it in, and does the join for you. This is
essentially option 1.
Option 3: Less generically, exploit the join you're actually doing. In the
dummy example, it looks like you're checking if a token is contained in
another string. You could convert this into a join by tokenizing,
flattening, doing the join, etc. I don't know how close your real use case
is to what you posted.

2012/8/29 Mat Kelcey <[EMAIL PROTECTED]>

> Hello!
> Considering the following two relations...
> grunt> querys = load 'query' as (id:int, token:chararray);
> grunt> dump querys
> (11,foo)
> (12,bar)
> (13,frog)
> and
> grunt> documents = load 'document' as (id:int, text:chararray);
> grunt> dump documents;
> (21,foo bar frog)
> (22,hello frog)
> Is is possible to do a join where the query:token is not equal to but
> contained in documents:text ?
> eg
> (11,foo,21,foo bar frog)
> (12,bar,21,foo bar frog)
> (13,frog,21,foo bar frog)
> (13,frog,22,hello frog)
> I can certainly do this in Java map/reduce (as we all had to in the
> dark days days before pig) but is there a way to hack this together
> with a custom udf or some other weird join backdoor (customer
> partitioner for a group or something whacky) ???
> It's been a long day, maybe I'm just missing some super obvious..
> Cheers!
> Mat
Mat Kelcey 2012-08-30, 00:14
Mat Kelcey 2012-08-30, 00:29
Mat Kelcey 2012-08-30, 00:48
Russell Jurney 2012-08-30, 00:04