Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parallel Join with Pig


+
Alberto Cordioli 2012-10-12, 09:46
+
Dmitriy Ryaboy 2012-10-12, 20:17
Copy link to this message
-
Re: Parallel Join with Pig
With 'key' you mean the join key, right?
If so, something doesn't work. I have 24 distinct values for my join
key, but with a parallel degree of 20, all the values go to the same
reducer.
Alberto.
On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> The default partitioning algorithm is basically this:
>
> reducer_id = key.hashCode() % num_reducers
>
> If you are joining on values that all map to the same reducer_id using
> this function, they will go to the same reducer. But if you have a
> reasonable hash code distribution and a decent volume of unique keys,
> they should be evenly distributed.
>
> Dmitriy
>
> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
> <[EMAIL PROTECTED]> wrote:
>> Hi all,
>>
>> I have a simple question about join in Pig.
>> I want to do a simple self join on a relation in Pig. So I load two
>> instances of the same relation in this way:
>>
>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>>
>> Then, I join by the second field:
>> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
>>
>>
>> I would expect that Pig assigns to a reducer tuples with the same join key.
>> For example if my relation I1=I2 is:
>>
>> x 1 10
>> x 2 15
>> y 1 4
>> y 4 7
>>
>> I expect one reducer join first and third tuples and another the other two.
>> What happens instead is that a single reducer do the join for all the
>> tuples. This results in 19 useless reducer and 1 overloaded.
>>
>>
>> Can someone explain me why this happens? The standard Pig join does
>> not parallelize the work by join key?
>>
>>
>> Thanks,
>> Alberto
>>
>> --
>> Alberto Cordioli

--
Alberto Cordioli
+
Alberto Cordioli 2012-10-15, 08:20
+
Jonathan Coveney 2012-10-15, 17:57
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB