Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parallel Join with Pig


+
Alberto Cordioli 2012-10-12, 09:46
+
Dmitriy Ryaboy 2012-10-12, 20:17
+
Alberto Cordioli 2012-10-13, 11:32
Copy link to this message
-
Re: Parallel Join with Pig
Ok, I've found....

I was using values that return all the same value % number of reducers.
For an unfortunate case I tested always multiple values......Ohhh, my fault.
Cheers,
Alberto
On 13 October 2012 13:32, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
> With 'key' you mean the join key, right?
> If so, something doesn't work. I have 24 distinct values for my join
> key, but with a parallel degree of 20, all the values go to the same
> reducer.
>
>
> Alberto.
>
>
> On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>> The default partitioning algorithm is basically this:
>>
>> reducer_id = key.hashCode() % num_reducers
>>
>> If you are joining on values that all map to the same reducer_id using
>> this function, they will go to the same reducer. But if you have a
>> reasonable hash code distribution and a decent volume of unique keys,
>> they should be evenly distributed.
>>
>> Dmitriy
>>
>> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
>> <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>>
>>> I have a simple question about join in Pig.
>>> I want to do a simple self join on a relation in Pig. So I load two
>>> instances of the same relation in this way:
>>>
>>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>>>
>>> Then, I join by the second field:
>>> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
>>>
>>>
>>> I would expect that Pig assigns to a reducer tuples with the same join key.
>>> For example if my relation I1=I2 is:
>>>
>>> x 1 10
>>> x 2 15
>>> y 1 4
>>> y 4 7
>>>
>>> I expect one reducer join first and third tuples and another the other two.
>>> What happens instead is that a single reducer do the join for all the
>>> tuples. This results in 19 useless reducer and 1 overloaded.
>>>
>>>
>>> Can someone explain me why this happens? The standard Pig join does
>>> not parallelize the work by join key?
>>>
>>>
>>> Thanks,
>>> Alberto
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

--
Alberto Cordioli
+
Jonathan Coveney 2012-10-15, 17:57
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB