-Re: Parallel Join with Pig
Alberto Cordioli 2012-10-13, 11:32
With 'key' you mean the join key, right?
If so, something doesn't work. I have 24 distinct values for my join
key, but with a parallel degree of 20, all the values go to the same
On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> The default partitioning algorithm is basically this:
> reducer_id = key.hashCode() % num_reducers
> If you are joining on values that all map to the same reducer_id using
> this function, they will go to the same reducer. But if you have a
> reasonable hash code distribution and a decent volume of unique keys,
> they should be evenly distributed.
> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
> <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> I have a simple question about join in Pig.
>> I want to do a simple self join on a relation in Pig. So I load two
>> instances of the same relation in this way:
>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>> Then, I join by the second field:
>> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
>> I would expect that Pig assigns to a reducer tuples with the same join key.
>> For example if my relation I1=I2 is:
>> x 1 10
>> x 2 15
>> y 1 4
>> y 4 7
>> I expect one reducer join first and third tuples and another the other two.
>> What happens instead is that a single reducer do the join for all the
>> tuples. This results in 19 useless reducer and 1 overloaded.
>> Can someone explain me why this happens? The standard Pig join does
>> not parallelize the work by join key?
>> Alberto Cordioli