Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parallel Join with Pig


Copy link to this message
-
Parallel Join with Pig
Hi all,

I have a simple question about join in Pig.
I want to do a simple self join on a relation in Pig. So I load two
instances of the same relation in this way:

I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);

Then, I join by the second field:
x = JOIN I1 BY b, I2 BY b PARALLEL 20;
I would expect that Pig assigns to a reducer tuples with the same join key.
For example if my relation I1=I2 is:

x 1 10
x 2 15
y 1 4
y 4 7

I expect one reducer join first and third tuples and another the other two.
What happens instead is that a single reducer do the join for all the
tuples. This results in 19 useless reducer and 1 overloaded.
Can someone explain me why this happens? The standard Pig join does
not parallelize the work by join key?
Thanks,
Alberto

--
Alberto Cordioli
+
Dmitriy Ryaboy 2012-10-12, 20:17
+
Alberto Cordioli 2012-10-13, 11:32
+
Alberto Cordioli 2012-10-15, 08:20
+
Jonathan Coveney 2012-10-15, 17:57