Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Controlling the sortation of DataBags input to a UDF?


Copy link to this message
-
Re: Controlling the sortation of DataBags input to a UDF?
I just checked the query plan with with 0.7, it also has this optimization .
-Thejas
On 8/17/10 1:32 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

Thejas, is that part of the new secondary sort optimization work that's in
trunk, or was this in 0.7?

-D

On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:

> Pig will use the sort column of the bag as a secondary sort key for the MR
> job.
> Though in the case of co-group, it is only doing that for the first bag. If
> you have a large bag and a small one, you can position them in the pig
> query
> so that secondary sort is used on large one.
>
>
> This is what I tried (pig svn trunk version)-
> grunt> l1 = load 'x' as (a,b);
> grunt> l2 = load 'y' as (a,b);
> grunt> cg = cogroup l1 by a, l2 by a;
> grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b);
> generate
> group, o1, o2;}
>
> -- in the following explain output note that there is no POSort for o1 ,
> and
> it says "Secondary sort: true"
>
> grunt> explain f
> ..
> ..
> #--------------------------------------------------
> # Map Reduce Plan
> #--------------------------------------------------
> MapReduce node 1-1018
> Map Plan
> Union[tuple] - 1-1019
> |
> |---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
> |   |   |
> |   |   Project[bytearray][0] - 1-1002
> |   |
> |   |---l1:
>
> Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
> rage) - 1-997
> |
> |---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
>    |   |
>    |   Project[bytearray][0] - 1-1004
>    |
>    |---l2:
>
> Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
> rage) - 1-998--------
> Reduce Plan
> f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
> |
> |---f: New For Each(false,false,false)[bag] - 1-1014
>    |   |
>    |   Project[bytearray][0] - 1-1005
>    |   |
>    |   RelationToExpressionProject[bag][*] - 1-1009
>    |   |
>    |   |---Project[tuple][1] - 1-1006
>    |   |
>    |   RelationToExpressionProject[bag][*] - 1-1013
>    |   |
>    |   |---o2: POSort[bag]() - 1-1012
>    |       |   |
>    |       |   Project[bytearray][1] - 1-1011
>    |       |
>    |       |---Project[tuple][2] - 1-1010
>    |
>    |---cg: Package[tuple]{bytearray} - 1-1000--------
> Global sort: false
> Secondary sort: true
> ----------------
>
>
>
>
>
> On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote:
>
> > I need to sort the DataBags that are input to my UDF after a COGROUP.
> > I am currently sorting them in memory but it is not going to scale in
> > the long term.
> >
> > Is there a way to control the way that Pig sorts them (e.g. as you can
> > with a WritableComparable in raw map/reduce) prior to passing them in
> > so that I don't have to respill them to disk?
> >
> > Thanks for any info,
> > Anthony
> >
>
>
>