Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Parallel Join with Pig


Copy link to this message
-
Re: Parallel Join with Pig
The default partitioning algorithm is basically this:

reducer_id = key.hashCode() % num_reducers

If you are joining on values that all map to the same reducer_id using
this function, they will go to the same reducer. But if you have a
reasonable hash code distribution and a decent volume of unique keys,
they should be evenly distributed.

Dmitriy

On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
<[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I have a simple question about join in Pig.
> I want to do a simple self join on a relation in Pig. So I load two
> instances of the same relation in this way:
>
> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>
> Then, I join by the second field:
> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
>
>
> I would expect that Pig assigns to a reducer tuples with the same join key.
> For example if my relation I1=I2 is:
>
> x 1 10
> x 2 15
> y 1 4
> y 4 7
>
> I expect one reducer join first and third tuples and another the other two.
> What happens instead is that a single reducer do the join for all the
> tuples. This results in 19 useless reducer and 1 overloaded.
>
>
> Can someone explain me why this happens? The standard Pig join does
> not parallelize the work by join key?
>
>
> Thanks,
> Alberto
>
> --
> Alberto Cordioli
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB