Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Custom joins


+
Mat Kelcey 2012-08-29, 23:55
+
Mat Kelcey 2012-08-30, 00:08
Copy link to this message
-
Re: Custom joins
You're not missing anything obvious... what you're trying to do, on face
value, is not an easy thing to do. In M/R, joining is done based on
partitioning to the same reducer...how can you do that if you have a case

foo
bar

foo bar

and foo is sent to reducer 1, bar to reducer 2? There's no way to know
where keys should be sent.

That said, there are options.

Option 1: a cross. Undesirable because of data explosion.
Option 2: If one of the data sets is large enough to fit in memory, you can
make a UDF that brings it in, and does the join for you. This is
essentially option 1.
Option 3: Less generically, exploit the join you're actually doing. In the
dummy example, it looks like you're checking if a token is contained in
another string. You could convert this into a join by tokenizing,
flattening, doing the join, etc. I don't know how close your real use case
is to what you posted.

Jon
2012/8/29 Mat Kelcey <[EMAIL PROTECTED]>

> Hello!
>
> Considering the following two relations...
>
> grunt> querys = load 'query' as (id:int, token:chararray);
> grunt> dump querys
> (11,foo)
> (12,bar)
> (13,frog)
>
> and
>
> grunt> documents = load 'document' as (id:int, text:chararray);
> grunt> dump documents;
> (21,foo bar frog)
> (22,hello frog)
>
> Is is possible to do a join where the query:token is not equal to but
> contained in documents:text ?
>
> eg
> (11,foo,21,foo bar frog)
> (12,bar,21,foo bar frog)
> (13,frog,21,foo bar frog)
> (13,frog,22,hello frog)
>
> I can certainly do this in Java map/reduce (as we all had to in the
> dark days days before pig) but is there a way to hack this together
> with a custom udf or some other weird join backdoor (customer
> partitioner for a group or something whacky) ???
>
> It's been a long day, maybe I'm just missing some super obvious..
>
> Cheers!
> Mat
>
+
Mat Kelcey 2012-08-30, 00:14
+
Mat Kelcey 2012-08-30, 00:29
+
Mat Kelcey 2012-08-30, 00:48
+
Russell Jurney 2012-08-30, 00:04