Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Efficient ways to do non-equijoins?

Jonathan Coveney 2011-01-27, 01:09
Jonathan Coveney 2011-01-27, 01:15
Alan Gates 2011-01-27, 16:48
Alan Gates 2011-01-27, 16:52
Jonathan Coveney 2011-01-27, 17:01
Copy link to this message
Re: Efficient ways to do non-equijoins?
Thejas M Nair 2011-01-27, 16:57
If B is small enough to fit into memory of a map task, you can use replicated join to simulate cross, that would be much faster -
    cro = join A by 1, B by 1 using 'replicated';

If the ranges size in B are fixed and contiguous (eg 1-10,11-20,21-30...), you can use a udf to map A.val to values in B.min and do a join.

  J = join A by getMin(val), B by min ;
  C = foreach J generate val, thing;


On 1/26/11 5:15 PM, "Jonathan Coveney" <[EMAIL PROTECTED]> wrote:

Also, it'd be worth thinking about this for the case where the min and maxes
are arbitrary, and also the case where they aren't overlapping. That is to
say, there is only one thing for a given value.

2011/1/26 Jonathan Coveney <[EMAIL PROTECTED]>

> A is (val:int)
> B is (thing:chararray, min:int, max:int)
> Basically what I want is C = (val, thing) where val is between min and max
> for that thing. In sql the syntax for this would not be hard, in pig the
> naive solution I have is..
> cro = CROSS A,B;
> fil = FILTER cro BY val >= min AND val <= max;
> C = FOREACH fil GENERATE val,thing;
> I am wondering what the most efficient way of doing this sort of operation
> is. I imagine with some sort of indexing you could ideally speed things up?
> Not sure. But this is important enough that I'd be willing to do some
> legwork.
> As always, thanks for your help.