Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Controlling the sortation of DataBags input to a UDF?


Copy link to this message
-
Re: Controlling the sortation of DataBags input to a UDF?
Pig will use the sort column of the bag as a secondary sort key for the MR
job.
Though in the case of co-group, it is only doing that for the first bag. If
you have a large bag and a small one, you can position them in the pig query
so that secondary sort is used on large one.
This is what I tried (pig svn trunk version)-
grunt> l1 = load 'x' as (a,b);
grunt> l2 = load 'y' as (a,b);
grunt> cg = cogroup l1 by a, l2 by a;
grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b); generate
group, o1, o2;}

-- in the following explain output note that there is no POSort for o1 , and
it says "Secondary sort: true"

grunt> explain f
..
..
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-1018
Map Plan
Union[tuple] - 1-1019
|
|---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
|   |   |
|   |   Project[bytearray][0] - 1-1002
|   |
|   |---l1:
Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
rage) - 1-997
|
|---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
    |   |
    |   Project[bytearray][0] - 1-1004
    |
    |---l2:
Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
rage) - 1-998--------
Reduce Plan
f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
|
|---f: New For Each(false,false,false)[bag] - 1-1014
    |   |
    |   Project[bytearray][0] - 1-1005
    |   |
    |   RelationToExpressionProject[bag][*] - 1-1009
    |   |
    |   |---Project[tuple][1] - 1-1006
    |   |
    |   RelationToExpressionProject[bag][*] - 1-1013
    |   |
    |   |---o2: POSort[bag]() - 1-1012
    |       |   |
    |       |   Project[bytearray][1] - 1-1011
    |       |
    |       |---Project[tuple][2] - 1-1010
    |
    |---cg: Package[tuple]{bytearray} - 1-1000--------
Global sort: false
Secondary sort: true
----------------

On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote:

> I need to sort the DataBags that are input to my UDF after a COGROUP.
> I am currently sorting them in memory but it is not going to scale in
> the long term.
>
> Is there a way to control the way that Pig sorts them (e.g. as you can
> with a WritableComparable in raw map/reduce) prior to passing them in
> so that I don't have to respill them to disk?
>
> Thanks for any info,
> Anthony
>