|
|
-
Controlling the sortation of DataBags input to a UDF?
Anthony Urso 2010-08-17, 18:59
I need to sort the DataBags that are input to my UDF after a COGROUP. I am currently sorting them in memory but it is not going to scale in the long term.
Is there a way to control the way that Pig sorts them (e.g. as you can with a WritableComparable in raw map/reduce) prior to passing them in so that I don't have to respill them to disk?
Thanks for any info, Anthony
-
Re: Controlling the sortation of DataBags input to a UDF?
Thejas M Nair 2010-08-17, 20:17
Pig will use the sort column of the bag as a secondary sort key for the MR job. Though in the case of co-group, it is only doing that for the first bag. If you have a large bag and a small one, you can position them in the pig query so that secondary sort is used on large one. This is what I tried (pig svn trunk version)- grunt> l1 = load 'x' as (a,b); grunt> l2 = load 'y' as (a,b); grunt> cg = cogroup l1 by a, l2 by a; grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b); generate group, o1, o2;}
-- in the following explain output note that there is no POSort for o1 , and it says "Secondary sort: true"
grunt> explain f .. .. #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node 1-1018 Map Plan Union[tuple] - 1-1019 | |---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001 | | | | | Project[bytearray][0] - 1-1002 | | | |---l1: Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto rage) - 1-997 | |---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003 | | | Project[bytearray][0] - 1-1004 | |---l2: Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto rage) - 1-998-------- Reduce Plan f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015 | |---f: New For Each(false,false,false)[bag] - 1-1014 | | | Project[bytearray][0] - 1-1005 | | | RelationToExpressionProject[bag][*] - 1-1009 | | | |---Project[tuple][1] - 1-1006 | | | RelationToExpressionProject[bag][*] - 1-1013 | | | |---o2: POSort[bag]() - 1-1012 | | | | | Project[bytearray][1] - 1-1011 | | | |---Project[tuple][2] - 1-1010 | |---cg: Package[tuple]{bytearray} - 1-1000-------- Global sort: false Secondary sort: true ----------------
On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote:
> I need to sort the DataBags that are input to my UDF after a COGROUP. > I am currently sorting them in memory but it is not going to scale in > the long term. > > Is there a way to control the way that Pig sorts them (e.g. as you can > with a WritableComparable in raw map/reduce) prior to passing them in > so that I don't have to respill them to disk? > > Thanks for any info, > Anthony >
-
Re: Controlling the sortation of DataBags input to a UDF?
Dmitriy Ryaboy 2010-08-17, 20:32
Thejas, is that part of the new secondary sort optimization work that's in trunk, or was this in 0.7?
-D
On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:
> Pig will use the sort column of the bag as a secondary sort key for the MR > job. > Though in the case of co-group, it is only doing that for the first bag. If > you have a large bag and a small one, you can position them in the pig > query > so that secondary sort is used on large one. > > > This is what I tried (pig svn trunk version)- > grunt> l1 = load 'x' as (a,b); > grunt> l2 = load 'y' as (a,b); > grunt> cg = cogroup l1 by a, l2 by a; > grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b); > generate > group, o1, o2;} > > -- in the following explain output note that there is no POSort for o1 , > and > it says "Secondary sort: true" > > grunt> explain f > .. > .. > #-------------------------------------------------- > # Map Reduce Plan > #-------------------------------------------------- > MapReduce node 1-1018 > Map Plan > Union[tuple] - 1-1019 > | > |---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001 > | | | > | | Project[bytearray][0] - 1-1002 > | | > | |---l1: > > Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto > rage) - 1-997 > | > |---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003 > | | > | Project[bytearray][0] - 1-1004 > | > |---l2: > > Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto > rage) - 1-998-------- > Reduce Plan > f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015 > | > |---f: New For Each(false,false,false)[bag] - 1-1014 > | | > | Project[bytearray][0] - 1-1005 > | | > | RelationToExpressionProject[bag][*] - 1-1009 > | | > | |---Project[tuple][1] - 1-1006 > | | > | RelationToExpressionProject[bag][*] - 1-1013 > | | > | |---o2: POSort[bag]() - 1-1012 > | | | > | | Project[bytearray][1] - 1-1011 > | | > | |---Project[tuple][2] - 1-1010 > | > |---cg: Package[tuple]{bytearray} - 1-1000-------- > Global sort: false > Secondary sort: true > ---------------- > > > > > > On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote: > > > I need to sort the DataBags that are input to my UDF after a COGROUP. > > I am currently sorting them in memory but it is not going to scale in > > the long term. > > > > Is there a way to control the way that Pig sorts them (e.g. as you can > > with a WritableComparable in raw map/reduce) prior to passing them in > > so that I don't have to respill them to disk? > > > > Thanks for any info, > > Anthony > > > > >
-
Re: Controlling the sortation of DataBags input to a UDF?
Thejas M Nair 2010-08-17, 23:18
I just checked the query plan with with 0.7, it also has this optimization . -Thejas On 8/17/10 1:32 PM, "Dmitriy Ryaboy" <[EMAIL PROTECTED]> wrote:
Thejas, is that part of the new secondary sort optimization work that's in trunk, or was this in 0.7?
-D
On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair <[EMAIL PROTECTED]> wrote:
> Pig will use the sort column of the bag as a secondary sort key for the MR > job. > Though in the case of co-group, it is only doing that for the first bag. If > you have a large bag and a small one, you can position them in the pig > query > so that secondary sort is used on large one. > > > This is what I tried (pig svn trunk version)- > grunt> l1 = load 'x' as (a,b); > grunt> l2 = load 'y' as (a,b); > grunt> cg = cogroup l1 by a, l2 by a; > grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b); > generate > group, o1, o2;} > > -- in the following explain output note that there is no POSort for o1 , > and > it says "Secondary sort: true" > > grunt> explain f > .. > .. > #-------------------------------------------------- > # Map Reduce Plan > #-------------------------------------------------- > MapReduce node 1-1018 > Map Plan > Union[tuple] - 1-1019 > | > |---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001 > | | | > | | Project[bytearray][0] - 1-1002 > | | > | |---l1: > > Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto > rage) - 1-997 > | > |---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003 > | | > | Project[bytearray][0] - 1-1004 > | > |---l2: > > Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto > rage) - 1-998-------- > Reduce Plan > f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015 > | > |---f: New For Each(false,false,false)[bag] - 1-1014 > | | > | Project[bytearray][0] - 1-1005 > | | > | RelationToExpressionProject[bag][*] - 1-1009 > | | > | |---Project[tuple][1] - 1-1006 > | | > | RelationToExpressionProject[bag][*] - 1-1013 > | | > | |---o2: POSort[bag]() - 1-1012 > | | | > | | Project[bytearray][1] - 1-1011 > | | > | |---Project[tuple][2] - 1-1010 > | > |---cg: Package[tuple]{bytearray} - 1-1000-------- > Global sort: false > Secondary sort: true > ---------------- > > > > > > On 8/17/10 11:59 AM, "Anthony Urso" <[EMAIL PROTECTED]> wrote: > > > I need to sort the DataBags that are input to my UDF after a COGROUP. > > I am currently sorting them in memory but it is not going to scale in > > the long term. > > > > Is there a way to control the way that Pig sorts them (e.g. as you can > > with a WritableComparable in raw map/reduce) prior to passing them in > > so that I don't have to respill them to disk? > > > > Thanks for any info, > > Anthony > > > > >
|
|