Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Parallel Join with Pig


Copy link to this message
-
Re: Parallel Join with Pig
Dmitriy Ryaboy 2012-10-12, 20:17
The default partitioning algorithm is basically this:

reducer_id = key.hashCode() % num_reducers

If you are joining on values that all map to the same reducer_id using
this function, they will go to the same reducer. But if you have a
reasonable hash code distribution and a decent volume of unique keys,
they should be evenly distributed.

Dmitriy

On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I have a simple question about join in Pig.
> I want to do a simple self join on a relation in Pig. So I load two
> instances of the same relation in this way:
>
> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>
> Then, I join by the second field:
> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
>
>
> I would expect that Pig assigns to a reducer tuples with the same join key.
> For example if my relation I1=I2 is:
>
> x 1 10
> x 2 15
> y 1 4
> y 4 7
>
> I expect one reducer join first and third tuples and another the other two.
> What happens instead is that a single reducer do the join for all the
> tuples. This results in 19 useless reducer and 1 overloaded.
>
>
> Can someone explain me why this happens? The standard Pig join does
> not parallelize the work by join key?
>
>
> Thanks,
> Alberto
>
> --
> Alberto Cordioli