Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Parallel Join with Pig


Copy link to this message
-
Parallel Join with Pig
Hi all,

I have a simple question about join in Pig.
I want to do a simple self join on a relation in Pig. So I load two
instances of the same relation in this way:

I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);

Then, I join by the second field:
x = JOIN I1 BY b, I2 BY b PARALLEL 20;
I would expect that Pig assigns to a reducer tuples with the same join key.
For example if my relation I1=I2 is:

x 1 10
x 2 15
y 1 4
y 4 7

I expect one reducer join first and third tuples and another the other two.
What happens instead is that a single reducer do the join for all the
tuples. This results in 19 useless reducer and 1 overloaded.
Can someone explain me why this happens? The standard Pig join does
not parallelize the work by join key?
Thanks,
Alberto

--
Alberto Cordioli
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB