Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Controlling the sortation of DataBags input to a UDF?


Copy link to this message
-
Re: Controlling the sortation of DataBags input to a UDF?
Pig will use the sort column of the bag as a secondary sort key for the MR
job.
Though in the case of co-group, it is only doing that for the first bag. If
you have a large bag and a small one, you can position them in the pig query
so that secondary sort is used on large one.
This is what I tried (pig svn trunk version)-
grunt> l1 = load 'x' as (a,b);
grunt> l2 = load 'y' as (a,b);
grunt> cg = cogroup l1 by a, l2 by a;
grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b); generate
group, o1, o2;}

-- in the following explain output note that there is no POSort for o1 , and
it says "Secondary sort: true"

grunt> explain f
..
..
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-1018
Map Plan
Union[tuple] - 1-1019
|
|---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
|   |   |
|   |   Project[bytearray][0] - 1-1002
|   |
|   |---l1:
Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
rage) - 1-997
|
|---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
    |   |
    |   Project[bytearray][0] - 1-1004
    |
    |---l2:
Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
rage) - 1-998--------
Reduce Plan
f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
|
|---f: New For Each(false,false,false)[bag] - 1-1014
    |   |
    |   Project[bytearray][0] - 1-1005
    |   |
    |   RelationToExpressionProject[bag][*] - 1-1009
    |   |
    |   |---Project[tuple][1] - 1-1006
    |   |
    |   RelationToExpressionProject[bag][*] - 1-1013
    |   |
    |   |---o2: POSort[bag]() - 1-1012
    |       |   |
    |       |   Project[bytearray][1] - 1-1011
    |       |
    |       |---Project[tuple][2] - 1-1010
    |
    |---cg: Package[tuple]{bytearray} - 1-1000--------
Global sort: false
Secondary sort: true
----------------

On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote:

> I need to sort the DataBags that are input to my UDF after a COGROUP.
> I am currently sorting them in memory but it is not going to scale in
> the long term.
>
> Is there a way to control the way that Pig sorts them (e.g. as you can
> with a WritableComparable in raw map/reduce) prior to passing them in
> so that I don't have to respill them to disk?
>
> Thanks for any info,
> Anthony
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB