-Re: Parallel Join with Pig
Dmitriy Ryaboy 2012-10-12, 20:17
The default partitioning algorithm is basically this:
reducer_id = key.hashCode() % num_reducers
If you are joining on values that all map to the same reducer_id using
this function, they will go to the same reducer. But if you have a
reasonable hash code distribution and a decent volume of unique keys,
they should be evenly distributed.
On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
<[EMAIL PROTECTED]> wrote:
> Hi all,
> I have a simple question about join in Pig.
> I want to do a simple self join on a relation in Pig. So I load two
> instances of the same relation in this way:
> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
> Then, I join by the second field:
> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
> I would expect that Pig assigns to a reducer tuples with the same join key.
> For example if my relation I1=I2 is:
> x 1 10
> x 2 15
> y 1 4
> y 4 7
> I expect one reducer join first and third tuples and another the other two.
> What happens instead is that a single reducer do the join for all the
> tuples. This results in 19 useless reducer and 1 overloaded.
> Can someone explain me why this happens? The standard Pig join does
> not parallelize the work by join key?
> Alberto Cordioli