|
|
-
removing last item in a bag
Chan, Tim 2013-03-12, 23:33
How do I remove the last item in a bag.
For example:
(group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) I would like to remove the last item so that the following is the result: (group_1,{(2012-12-15,a),(2012-12-17,a)})
-
Re: removing last item in a bag
Johnny Zhang 2013-03-12, 23:50
Hi, Chan: I guess you might generate the bag like this A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); B = group A by f1; C = foreach B generate *; describe C; C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: chararray)}}
if this is the case, you can do: A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); B = group A by f1; C = foreach B generate group, A.f1, A.f2; describe C; C: {group: chararray,{(f1: chararray)},{(f2: chararray)}}
does this make sense? otherwise can you share your script which generates the bag?
Johnny Zhang On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
> How do I remove the last item in a bag. > > For example: > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) > > > I would like to remove the last item so that the following is the result: > > > (group_1,{(2012-12-15,a),(2012-12-17,a)}) >
-
Re: removing last item in a bag
Chan, Tim 2013-03-13, 00:28
Hi Johnny,
I forgot to mention the bag will be varying sizes, so I can not use the method you described. On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]> wrote:
> Hi, Chan: > I guess you might generate the bag like this > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > B = group A by f1; > C = foreach B generate *; > describe C; > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: chararray)}} > > if this is the case, you can do: > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > B = group A by f1; > C = foreach B generate group, A.f1, A.f2; > describe C; > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}} > > does this make sense? otherwise can you share your script which generates > the bag? > > Johnny Zhang > > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: > > > How do I remove the last item in a bag. > > > > For example: > > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) > > > > > > I would like to remove the last item so that the following is the result: > > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a)}) > > >
-
Re: removing last item in a bag
Johnny Zhang 2013-03-13, 00:40
Hi, Chan: That's fine. How did you generate the bag with different size in runtime. It will be easier for me to come out a solution by this information. Thanks.
Johnny On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
> Hi Johnny, > > I forgot to mention the bag will be varying sizes, so I can not use the > method you described. > > > > > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]> > wrote: > > > Hi, Chan: > > I guess you might generate the bag like this > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > > B = group A by f1; > > C = foreach B generate *; > > describe C; > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: > chararray)}} > > > > if this is the case, you can do: > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > > B = group A by f1; > > C = foreach B generate group, A.f1, A.f2; > > describe C; > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}} > > > > does this make sense? otherwise can you share your script which generates > > the bag? > > > > Johnny Zhang > > > > > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: > > > > > How do I remove the last item in a bag. > > > > > > For example: > > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) > > > > > > > > > I would like to remove the last item so that the following is the > result: > > > > > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a)}) > > > > > >
-
Re: removing last item in a bag
Ruslan Al-Fakikh 2013-03-13, 03:06
Hi Chan,
Your tasks seems to be not trivial in Pig. Basically bags are not ordered, so you have to either sort before or to decide what tuple you want to remove exactly. Some ways to solve the problem: 1) You can use the TOP builtin UDF which basically does the opposite and I am not sure whether it will suit you from the performance point of view 2) You can try something like this: inputData = LOAD 'input' AS (key: chararray, date: chararray, letter: chararray); grouped = GROUP inputData BY key; DESCRIBE grouped; DUMP grouped; withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count; DESCRIBE withCounts; DUMP withCounts; trimmed = FOREACH withCounts { ordered = ORDER inputData BY key; limited = LIMIT ordered (withCounts.Count - 1); GENERATE group, limited; } DESCRIBE trimmed; DUMP trimmed;
This is what I got when run on Pig 0.10:
grouped: {group: chararray,inputData: {(key: chararray,date: chararray,letter: chararray)}} (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})
withCounts: {group: chararray,inputData: {(key: chararray,date: chararray,letter: chararray)},Count: long} (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)
trimmed: {group: chararray,limited: {(key: chararray,date: chararray,letter: chararray)}} (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})
I am not sure whether it will perform well. Let me know if it helps.
Best Regards, Ruslan Al-Fakikh On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]> wrote:
> Hi, Chan: > That's fine. How did you generate the bag with different size in runtime. > It will be easier for me to come out a solution by this information. > Thanks. > > Johnny > > > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: > > > Hi Johnny, > > > > I forgot to mention the bag will be varying sizes, so I can not use the > > method you described. > > > > > > > > > > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]> > > wrote: > > > > > Hi, Chan: > > > I guess you might generate the bag like this > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > > > B = group A by f1; > > > C = foreach B generate *; > > > describe C; > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: > > chararray)}} > > > > > > if this is the case, you can do: > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > > > B = group A by f1; > > > C = foreach B generate group, A.f1, A.f2; > > > describe C; > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}} > > > > > > does this make sense? otherwise can you share your script which > generates > > > the bag? > > > > > > Johnny Zhang > > > > > > > > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: > > > > > > > How do I remove the last item in a bag. > > > > > > > > For example: > > > > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) > > > > > > > > > > > > I would like to remove the last item so that the following is the > > result: > > > > > > > > > > > > (group_1,{(2012-12-15,a),(2012-12-17,a)}) > > > > > > > > > >
-
Re: removing last item in a bag
Ruslan Al-Fakikh 2013-03-13, 03:09
Chan,
Sorry, I meant ordered = ORDER inputData BY date; not ordered = ORDER inputData BY key; On Wed, Mar 13, 2013 at 7:06 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:
> Hi Chan, > > Your tasks seems to be not trivial in Pig. Basically bags are not ordered, > so you have to either sort before or to decide what tuple you want to > remove exactly. Some ways to solve the problem: > 1) You can use the TOP builtin UDF which basically does the opposite and I > am not sure whether it will suit you from the performance point of view > 2) You can try something like this: > inputData = LOAD 'input' AS (key: chararray, date: chararray, letter: > chararray); > grouped = GROUP inputData BY key; > DESCRIBE grouped; > DUMP grouped; > withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count; > DESCRIBE withCounts; > DUMP withCounts; > trimmed = FOREACH withCounts { > ordered = ORDER inputData BY key; > limited = LIMIT ordered (withCounts.Count - 1); > GENERATE > group, > limited; > } > DESCRIBE trimmed; > DUMP trimmed; > > This is what I got when run on Pig 0.10: > > grouped: {group: chararray,inputData: {(key: chararray,date: > chararray,letter: chararray)}} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)}) > > withCounts: {group: chararray,inputData: {(key: chararray,date: > chararray,letter: chararray)},Count: long} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3) > > trimmed: {group: chararray,limited: {(key: chararray,date: > chararray,letter: chararray)}} > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)}) > > I am not sure whether it will perform well. Let me know if it helps. > > Best Regards, > Ruslan Al-Fakikh > > > On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]>wrote: > >> Hi, Chan: >> That's fine. How did you generate the bag with different size in runtime. >> It will be easier for me to come out a solution by this information. >> Thanks. >> >> Johnny >> >> >> On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: >> >> > Hi Johnny, >> > >> > I forgot to mention the bag will be varying sizes, so I can not use the >> > method you described. >> > >> > >> > >> > >> > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Hi, Chan: >> > > I guess you might generate the bag like this >> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); >> > > B = group A by f1; >> > > C = foreach B generate *; >> > > describe C; >> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: >> > chararray)}} >> > > >> > > if this is the case, you can do: >> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); >> > > B = group A by f1; >> > > C = foreach B generate group, A.f1, A.f2; >> > > describe C; >> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}} >> > > >> > > does this make sense? otherwise can you share your script which >> generates >> > > the bag? >> > > >> > > Johnny Zhang >> > > >> > > >> > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: >> > > >> > > > How do I remove the last item in a bag. >> > > > >> > > > For example: >> > > > >> > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)}) >> > > > >> > > > >> > > > I would like to remove the last item so that the following is the >> > result: >> > > > >> > > > >> > > > (group_1,{(2012-12-15,a),(2012-12-17,a)}) >> > > > >> > > >> > >> > >
-
Re: removing last item in a bag
Tim Chan 2013-03-13, 20:58
Hi Ruslan,
I'm using the trunk version of Pig.
For the following script:
test = LOAD '$test' USING PigStorage('\t') AS ( visitor:chararray, submodelid:long, record_datetime:chararray );
test_grp = group test by visitor;
-- add counts of each bag test_grp_cnt = foreach test_grp generate *, COUNT(test) as submodel_count; smp = filter test_grp_cnt by submodel_count < 2; dump smp; -- remove second to last item in back after sorting test_last_removed = FOREACH test_grp_cnt { ordered = ORDER test BY record_datetime ASC; last_removed = LIMIT ordered (test_grp_cnt.submodel_count - 1); --last_removed = LIMIT ordered 3;
GENERATE group as visitor, last_removed; } I get the following error:
ERROR 1066: Unable to open iterator for alias test_last_removed_smp. Backend error : Scalar has more than one row in the output. 1st : (uc3:3,{(uc3:3,200410586,2013-02-06 09:18:22),(uc3:3,200437662,2013-02-06 08:58:25),(uc3:3,200414442,2013-02-06 09:04:24)},3), 2nd :(S:382290531917004,{(S:382290531917004,200442423,2013-02-01 21:15:58),(S:382290531917004,200409672,2013-02-01 21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3)
The error is not present when I comment out the "last_removed..." line and uncommented out the one below it. On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:
> Hi Chan, > > Your tasks seems to be not trivial in Pig. Basically bags are not ordered, > so you have to either sort before or to decide what tuple you want to > remove exactly. Some ways to solve the problem: > 1) You can use the TOP builtin UDF which basically does the opposite and I > am not sure whether it will suit you from the performance point of view > 2) You can try something like this: > inputData = LOAD 'input' AS (key: chararray, date: chararray, letter: > chararray); > grouped = GROUP inputData BY key; > DESCRIBE grouped; > DUMP grouped; > withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count; > DESCRIBE withCounts; > DUMP withCounts; > trimmed = FOREACH withCounts { > ordered = ORDER inputData BY key; > limited = LIMIT ordered (withCounts.Count - 1); > GENERATE > group, > limited; > } > DESCRIBE trimmed; > DUMP trimmed; > > This is what I got when run on Pig 0.10: > > grouped: {group: chararray,inputData: {(key: chararray,date: > chararray,letter: chararray)}} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)}) > > withCounts: {group: chararray,inputData: {(key: chararray,date: > chararray,letter: chararray)},Count: long} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3) > > trimmed: {group: chararray,limited: {(key: chararray,date: > chararray,letter: chararray)}} > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)}) > > I am not sure whether it will perform well. Let me know if it helps. > > Best Regards, > Ruslan Al-Fakikh > > > On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]> > wrote: > > > Hi, Chan: > > That's fine. How did you generate the bag with different size in runtime. > > It will be easier for me to come out a solution by this information. > > Thanks. > > > > Johnny > > > > > > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: > > > > > Hi Johnny, > > > > > > I forgot to mention the bag will be varying sizes, so I can not use the > > > method you described. > > > > > > > > > > > > > > > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi, Chan: > > > > I guess you might generate the bag like this > > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray); > > > > B = group A by f1; > > > > C = foreach B generate *; > > > > describe C; > > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3: > > > chararray)}} > > > > > > > > if this is the case, you can do: > > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
-
Re: removing last item in a bag
Ruslan Al-Fakikh 2013-03-15, 19:46
oh, sorry. It seems that my script worked only for the case where we have only 1 group. Basically here withCounts.Count I wanted to get access to the Count field in the context of the row being processed and it should be only 1 for one row, but with withCounts.Count it seems that it actually accesses the outer context and sees many rows in withCounts. Maybe someone else has any idea? On Thu, Mar 14, 2013 at 12:58 AM, Tim Chan <[EMAIL PROTECTED]> wrote:
> Hi Ruslan, > > I'm using the trunk version of Pig. > > For the following script: > > test = LOAD '$test' USING PigStorage('\t') AS > ( visitor:chararray, > submodelid:long, > record_datetime:chararray ); > > test_grp = group test by visitor; > > -- add counts of each bag > test_grp_cnt = foreach test_grp > generate > *, > COUNT(test) as submodel_count; > > > smp = filter test_grp_cnt by submodel_count < 2; > dump smp; > > > -- remove second to last item in back after sorting > test_last_removed = FOREACH test_grp_cnt { > ordered = ORDER test BY record_datetime ASC; > last_removed = LIMIT ordered (test_grp_cnt.submodel_count - 1); > --last_removed = LIMIT ordered 3; > > GENERATE > group as visitor, > last_removed; > } > > > I get the following error: > > ERROR 1066: Unable to open iterator for alias test_last_removed_smp. > Backend error : Scalar has more than one row in the output. 1st : > (uc3:3,{(uc3:3,200410586,2013-02-06 09:18:22),(uc3:3,200437662,2013-02-06 > 08:58:25),(uc3:3,200414442,2013-02-06 09:04:24)},3), 2nd > :(S:382290531917004,{(S:382290531917004,200442423,2013-02-01 > 21:15:58),(S:382290531917004,200409672,2013-02-01 > 21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3) > > The error is not present when I comment out the "last_removed..." line and > uncommented out the one below it. > > > > > On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED] > >wrote: > > > Hi Chan, > > > > Your tasks seems to be not trivial in Pig. Basically bags are not > ordered, > > so you have to either sort before or to decide what tuple you want to > > remove exactly. Some ways to solve the problem: > > 1) You can use the TOP builtin UDF which basically does the opposite and > I > > am not sure whether it will suit you from the performance point of view > > 2) You can try something like this: > > inputData = LOAD 'input' AS (key: chararray, date: chararray, letter: > > chararray); > > grouped = GROUP inputData BY key; > > DESCRIBE grouped; > > DUMP grouped; > > withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count; > > DESCRIBE withCounts; > > DUMP withCounts; > > trimmed = FOREACH withCounts { > > ordered = ORDER inputData BY key; > > limited = LIMIT ordered (withCounts.Count - 1); > > GENERATE > > group, > > limited; > > } > > DESCRIBE trimmed; > > DUMP trimmed; > > > > This is what I got when run on Pig 0.10: > > > > grouped: {group: chararray,inputData: {(key: chararray,date: > > chararray,letter: chararray)}} > > > > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)}) > > > > withCounts: {group: chararray,inputData: {(key: chararray,date: > > chararray,letter: chararray)},Count: long} > > > > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3) > > > > trimmed: {group: chararray,limited: {(key: chararray,date: > > chararray,letter: chararray)}} > > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)}) > > > > I am not sure whether it will perform well. Let me know if it helps. > > > > Best Regards, > > Ruslan Al-Fakikh > > > > > > On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]> > > wrote: > > > > > Hi, Chan: > > > That's fine. How did you generate the bag with different size in > runtime. > > > It will be easier for me to come out a solution by this information. > > > Thanks. > > > > > > Johnny > > > > > > > > > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
|
|