|
|
Alberto Cordioli 2012-10-12, 09:46
Hi all,
I have a simple question about join in Pig. I want to do a simple self join on a relation in Pig. So I load two instances of the same relation in this way:
I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
Then, I join by the second field: x = JOIN I1 BY b, I2 BY b PARALLEL 20; I would expect that Pig assigns to a reducer tuples with the same join key. For example if my relation I1=I2 is:
x 1 10 x 2 15 y 1 4 y 4 7
I expect one reducer join first and third tuples and another the other two. What happens instead is that a single reducer do the join for all the tuples. This results in 19 useless reducer and 1 overloaded. Can someone explain me why this happens? The standard Pig join does not parallelize the work by join key? Thanks, Alberto
-- Alberto Cordioli
-
Re: Parallel Join with Pig
Dmitriy Ryaboy 2012-10-12, 20:17
The default partitioning algorithm is basically this:
reducer_id = key.hashCode() % num_reducers
If you are joining on values that all map to the same reducer_id using this function, they will go to the same reducer. But if you have a reasonable hash code distribution and a decent volume of unique keys, they should be evenly distributed.
Dmitriy
On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli <[EMAIL PROTECTED]> wrote: > Hi all, > > I have a simple question about join in Pig. > I want to do a simple self join on a relation in Pig. So I load two > instances of the same relation in this way: > > I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); > I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); > > Then, I join by the second field: > x = JOIN I1 BY b, I2 BY b PARALLEL 20; > > > I would expect that Pig assigns to a reducer tuples with the same join key. > For example if my relation I1=I2 is: > > x 1 10 > x 2 15 > y 1 4 > y 4 7 > > I expect one reducer join first and third tuples and another the other two. > What happens instead is that a single reducer do the join for all the > tuples. This results in 19 useless reducer and 1 overloaded. > > > Can someone explain me why this happens? The standard Pig join does > not parallelize the work by join key? > > > Thanks, > Alberto > > -- > Alberto Cordioli
-
Re: Parallel Join with Pig
Alberto Cordioli 2012-10-13, 11:32
With 'key' you mean the join key, right? If so, something doesn't work. I have 24 distinct values for my join key, but with a parallel degree of 20, all the values go to the same reducer. Alberto. On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > The default partitioning algorithm is basically this: > > reducer_id = key.hashCode() % num_reducers > > If you are joining on values that all map to the same reducer_id using > this function, they will go to the same reducer. But if you have a > reasonable hash code distribution and a decent volume of unique keys, > they should be evenly distributed. > > Dmitriy > > On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli > <[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I have a simple question about join in Pig. >> I want to do a simple self join on a relation in Pig. So I load two >> instances of the same relation in this way: >> >> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); >> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); >> >> Then, I join by the second field: >> x = JOIN I1 BY b, I2 BY b PARALLEL 20; >> >> >> I would expect that Pig assigns to a reducer tuples with the same join key. >> For example if my relation I1=I2 is: >> >> x 1 10 >> x 2 15 >> y 1 4 >> y 4 7 >> >> I expect one reducer join first and third tuples and another the other two. >> What happens instead is that a single reducer do the join for all the >> tuples. This results in 19 useless reducer and 1 overloaded. >> >> >> Can someone explain me why this happens? The standard Pig join does >> not parallelize the work by join key? >> >> >> Thanks, >> Alberto >> >> -- >> Alberto Cordioli
-- Alberto Cordioli
-
Re: Parallel Join with Pig
Alberto Cordioli 2012-10-15, 08:20
Ok, I've found....
I was using values that return all the same value % number of reducers. For an unfortunate case I tested always multiple values......Ohhh, my fault. Cheers, Alberto On 13 October 2012 13:32, Alberto Cordioli <[EMAIL PROTECTED]> wrote: > With 'key' you mean the join key, right? > If so, something doesn't work. I have 24 distinct values for my join > key, but with a parallel degree of 20, all the values go to the same > reducer. > > > Alberto. > > > On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: >> The default partitioning algorithm is basically this: >> >> reducer_id = key.hashCode() % num_reducers >> >> If you are joining on values that all map to the same reducer_id using >> this function, they will go to the same reducer. But if you have a >> reasonable hash code distribution and a decent volume of unique keys, >> they should be evenly distributed. >> >> Dmitriy >> >> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli >> <[EMAIL PROTECTED]> wrote: >>> Hi all, >>> >>> I have a simple question about join in Pig. >>> I want to do a simple self join on a relation in Pig. So I load two >>> instances of the same relation in this way: >>> >>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); >>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); >>> >>> Then, I join by the second field: >>> x = JOIN I1 BY b, I2 BY b PARALLEL 20; >>> >>> >>> I would expect that Pig assigns to a reducer tuples with the same join key. >>> For example if my relation I1=I2 is: >>> >>> x 1 10 >>> x 2 15 >>> y 1 4 >>> y 4 7 >>> >>> I expect one reducer join first and third tuples and another the other two. >>> What happens instead is that a single reducer do the join for all the >>> tuples. This results in 19 useless reducer and 1 overloaded. >>> >>> >>> Can someone explain me why this happens? The standard Pig join does >>> not parallelize the work by join key? >>> >>> >>> Thanks, >>> Alberto >>> >>> -- >>> Alberto Cordioli > > > > -- > Alberto Cordioli
-- Alberto Cordioli
-
Re: Parallel Join with Pig
Jonathan Coveney 2012-10-15, 17:57
M/R is a useful but sometimes leaky abstraction. :)
2012/10/15 Alberto Cordioli <[EMAIL PROTECTED]>
> Ok, I've found.... > > I was using values that return all the same value % number of reducers. > For an unfortunate case I tested always multiple values......Ohhh, my > fault. > > > Cheers, > Alberto > > > On 13 October 2012 13:32, Alberto Cordioli <[EMAIL PROTECTED]> > wrote: > > With 'key' you mean the join key, right? > > If so, something doesn't work. I have 24 distinct values for my join > > key, but with a parallel degree of 20, all the values go to the same > > reducer. > > > > > > Alberto. > > > > > > On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > >> The default partitioning algorithm is basically this: > >> > >> reducer_id = key.hashCode() % num_reducers > >> > >> If you are joining on values that all map to the same reducer_id using > >> this function, they will go to the same reducer. But if you have a > >> reasonable hash code distribution and a decent volume of unique keys, > >> they should be evenly distributed. > >> > >> Dmitriy > >> > >> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli > >> <[EMAIL PROTECTED]> wrote: > >>> Hi all, > >>> > >>> I have a simple question about join in Pig. > >>> I want to do a simple self join on a relation in Pig. So I load two > >>> instances of the same relation in this way: > >>> > >>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); > >>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int); > >>> > >>> Then, I join by the second field: > >>> x = JOIN I1 BY b, I2 BY b PARALLEL 20; > >>> > >>> > >>> I would expect that Pig assigns to a reducer tuples with the same join > key. > >>> For example if my relation I1=I2 is: > >>> > >>> x 1 10 > >>> x 2 15 > >>> y 1 4 > >>> y 4 7 > >>> > >>> I expect one reducer join first and third tuples and another the other > two. > >>> What happens instead is that a single reducer do the join for all the > >>> tuples. This results in 19 useless reducer and 1 overloaded. > >>> > >>> > >>> Can someone explain me why this happens? The standard Pig join does > >>> not parallelize the work by join key? > >>> > >>> > >>> Thanks, > >>> Alberto > >>> > >>> -- > >>> Alberto Cordioli > > > > > > > > -- > > Alberto Cordioli > > > > -- > Alberto Cordioli >
|
|