Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Parallel Join with Pig


+
Alberto Cordioli 2012-10-12, 09:46
+
Dmitriy Ryaboy 2012-10-12, 20:17
+
Alberto Cordioli 2012-10-13, 11:32
Copy link to this message
-
Re: Parallel Join with Pig
Ok, I've found....

I was using values that return all the same value % number of reducers.
For an unfortunate case I tested always multiple values......Ohhh, my fault.
Cheers,
Alberto
On 13 October 2012 13:32, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
> With 'key' you mean the join key, right?
> If so, something doesn't work. I have 24 distinct values for my join
> key, but with a parallel degree of 20, all the values go to the same
> reducer.
>
>
> Alberto.
>
>
> On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
>> The default partitioning algorithm is basically this:
>>
>> reducer_id = key.hashCode() % num_reducers
>>
>> If you are joining on values that all map to the same reducer_id using
>> this function, they will go to the same reducer. But if you have a
>> reasonable hash code distribution and a decent volume of unique keys,
>> they should be evenly distributed.
>>
>> Dmitriy
>>
>> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
>> <[EMAIL PROTECTED]> wrote:
>>> Hi all,
>>>
>>> I have a simple question about join in Pig.
>>> I want to do a simple self join on a relation in Pig. So I load two
>>> instances of the same relation in this way:
>>>
>>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
>>>
>>> Then, I join by the second field:
>>> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
>>>
>>>
>>> I would expect that Pig assigns to a reducer tuples with the same join key.
>>> For example if my relation I1=I2 is:
>>>
>>> x 1 10
>>> x 2 15
>>> y 1 4
>>> y 4 7
>>>
>>> I expect one reducer join first and third tuples and another the other two.
>>> What happens instead is that a single reducer do the join for all the
>>> tuples. This results in 19 useless reducer and 1 overloaded.
>>>
>>>
>>> Can someone explain me why this happens? The standard Pig join does
>>> not parallelize the work by join key?
>>>
>>>
>>> Thanks,
>>> Alberto
>>>
>>> --
>>> Alberto Cordioli
>
>
>
> --
> Alberto Cordioli

--
Alberto Cordioli
+
Jonathan Coveney 2012-10-15, 17:57