Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Parallel Join with Pig


Copy link to this message
-
Re: Parallel Join with Pig
Jonathan Coveney 2012-10-15, 17:57
M/R is a useful but sometimes leaky abstraction. :)

2012/10/15 Alberto Cordioli <[EMAIL PROTECTED]>

> Ok, I've found....
>
> I was using values that return all the same value % number of reducers.
> For an unfortunate case I tested always multiple values......Ohhh, my
> fault.
>
>
> Cheers,
> Alberto
>
>
> On 13 October 2012 13:32, Alberto Cordioli <[EMAIL PROTECTED]>
> wrote:
> > With 'key' you mean the join key, right?
> > If so, something doesn't work. I have 24 distinct values for my join
> > key, but with a parallel degree of 20, all the values go to the same
> > reducer.
> >
> >
> > Alberto.
> >
> >
> > On 12 October 2012 22:17, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> >> The default partitioning algorithm is basically this:
> >>
> >> reducer_id = key.hashCode() % num_reducers
> >>
> >> If you are joining on values that all map to the same reducer_id using
> >> this function, they will go to the same reducer. But if you have a
> >> reasonable hash code distribution and a decent volume of unique keys,
> >> they should be evenly distributed.
> >>
> >> Dmitriy
> >>
> >> On Fri, Oct 12, 2012 at 2:46 AM, Alberto Cordioli
> >> <[EMAIL PROTECTED]> wrote:
> >>> Hi all,
> >>>
> >>> I have a simple question about join in Pig.
> >>> I want to do a simple self join on a relation in Pig. So I load two
> >>> instances of the same relation in this way:
> >>>
> >>> I1 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
> >>> I2 = LOAD '/myPath/myFile' as (a:chararray, b:int, c:int);
> >>>
> >>> Then, I join by the second field:
> >>> x = JOIN I1 BY b, I2 BY b PARALLEL 20;
> >>>
> >>>
> >>> I would expect that Pig assigns to a reducer tuples with the same join
> key.
> >>> For example if my relation I1=I2 is:
> >>>
> >>> x 1 10
> >>> x 2 15
> >>> y 1 4
> >>> y 4 7
> >>>
> >>> I expect one reducer join first and third tuples and another the other
> two.
> >>> What happens instead is that a single reducer do the join for all the
> >>> tuples. This results in 19 useless reducer and 1 overloaded.
> >>>
> >>>
> >>> Can someone explain me why this happens? The standard Pig join does
> >>> not parallelize the work by join key?
> >>>
> >>>
> >>> Thanks,
> >>> Alberto
> >>>
> >>> --
> >>> Alberto Cordioli
> >
> >
> >
> > --
> > Alberto Cordioli
>
>
>
> --
> Alberto Cordioli
>