Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Controlling the sortation of DataBags input to a UDF?


Copy link to this message
-
Re: Controlling the sortation of DataBags input to a UDF?
I just checked the query plan with with 0.7, it also has this optimization .
-Thejas
On 8/17/10 1:32 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:

Thejas, is that part of the new secondary sort optimization work that's in
trunk, or was this in 0.7?

-D

On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:

> Pig will use the sort column of the bag as a secondary sort key for the MR
> job.
> Though in the case of co-group, it is only doing that for the first bag. If
> you have a large bag and a small one, you can position them in the pig
> query
> so that secondary sort is used on large one.
>
>
> This is what I tried (pig svn trunk version)-
> grunt> l1 = load 'x' as (a,b);
> grunt> l2 = load 'y' as (a,b);
> grunt> cg = cogroup l1 by a, l2 by a;
> grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b);
> generate
> group, o1, o2;}
>
> -- in the following explain output note that there is no POSort for o1 ,
> and
> it says "Secondary sort: true"
>
> grunt> explain f
> ..
> ..
> #--------------------------------------------------
> # Map Reduce Plan
> #--------------------------------------------------
> MapReduce node 1-1018
> Map Plan
> Union[tuple] - 1-1019
> |
> |---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
> |   |   |
> |   |   Project[bytearray][0] - 1-1002
> |   |
> |   |---l1:
>
> Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
> rage) - 1-997
> |
> |---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
>    |   |
>    |   Project[bytearray][0] - 1-1004
>    |
>    |---l2:
>
> Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
> rage) - 1-998--------
> Reduce Plan
> f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
> |
> |---f: New For Each(false,false,false)[bag] - 1-1014
>    |   |
>    |   Project[bytearray][0] - 1-1005
>    |   |
>    |   RelationToExpressionProject[bag][*] - 1-1009
>    |   |
>    |   |---Project[tuple][1] - 1-1006
>    |   |
>    |   RelationToExpressionProject[bag][*] - 1-1013
>    |   |
>    |   |---o2: POSort[bag]() - 1-1012
>    |       |   |
>    |       |   Project[bytearray][1] - 1-1011
>    |       |
>    |       |---Project[tuple][2] - 1-1010
>    |
>    |---cg: Package[tuple]{bytearray} - 1-1000--------
> Global sort: false
> Secondary sort: true
> ----------------
>
>
>
>
>
> On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote:
>
> > I need to sort the DataBags that are input to my UDF after a COGROUP.
> > I am currently sorting them in memory but it is not going to scale in
> > the long term.
> >
> > Is there a way to control the way that Pig sorts them (e.g. as you can
> > with a WritableComparable in raw map/reduce) prior to passing them in
> > so that I don't have to respill them to disk?
> >
> > Thanks for any info,
> > Anthony
> >
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB