Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - removing last item in a bag


Copy link to this message
-
Re: removing last item in a bag
Tim Chan 2013-03-13, 20:58
Hi Ruslan,

I'm using the trunk version of Pig.

For the following script:

test = LOAD '$test' USING PigStorage('\t') AS
    ( visitor:chararray,
      submodelid:long,
      record_datetime:chararray );

test_grp = group test by visitor;

-- add counts of each bag
test_grp_cnt = foreach test_grp
    generate
        *,
        COUNT(test) as submodel_count;
smp = filter test_grp_cnt by submodel_count < 2;
dump smp;
-- remove second to last item in back after sorting
test_last_removed = FOREACH test_grp_cnt {
    ordered = ORDER test BY record_datetime ASC;
    last_removed = LIMIT ordered (test_grp_cnt.submodel_count - 1);
    --last_removed = LIMIT ordered 3;

    GENERATE
        group as visitor,
        last_removed;
}
I get the following error:

ERROR 1066: Unable to open iterator for alias test_last_removed_smp.
Backend error : Scalar has more than one row in the output. 1st :
(uc3:3,{(uc3:3,200410586,2013-02-06 09:18:22),(uc3:3,200437662,2013-02-06
08:58:25),(uc3:3,200414442,2013-02-06 09:04:24)},3), 2nd
:(S:382290531917004,{(S:382290531917004,200442423,2013-02-01
21:15:58),(S:382290531917004,200409672,2013-02-01
21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3)

The error is not present when I comment out the "last_removed..." line and
uncommented out the one below it.
On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi Chan,
>
> Your tasks seems to be not trivial in Pig. Basically bags are not ordered,
> so you have to either sort before or to decide what tuple you want to
> remove exactly. Some ways to solve the problem:
> 1) You can use the TOP builtin UDF which basically does the opposite and I
> am not sure whether it will suit you from the performance point of view
> 2) You can try something like this:
> inputData = LOAD 'input' AS (key: chararray, date: chararray, letter:
> chararray);
> grouped = GROUP inputData BY key;
> DESCRIBE grouped;
> DUMP grouped;
> withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count;
> DESCRIBE withCounts;
> DUMP withCounts;
> trimmed = FOREACH withCounts {
>         ordered = ORDER inputData BY key;
>         limited = LIMIT ordered (withCounts.Count - 1);
>         GENERATE
>                 group,
>                 limited;
> }
> DESCRIBE trimmed;
> DUMP trimmed;
>
> This is what I got when run on Pig 0.10:
>
> grouped: {group: chararray,inputData: {(key: chararray,date:
> chararray,letter: chararray)}}
>
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})
>
> withCounts: {group: chararray,inputData: {(key: chararray,date:
> chararray,letter: chararray)},Count: long}
>
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)
>
> trimmed: {group: chararray,limited: {(key: chararray,date:
> chararray,letter: chararray)}}
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})
>
> I am not sure whether it will perform well. Let me know if it helps.
>
> Best Regards,
> Ruslan Al-Fakikh
>
>
> On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]>
> wrote:
>
> > Hi, Chan:
> > That's fine. How did you generate the bag with different size in runtime.
> > It will be easier for me to come out a solution by this information.
> > Thanks.
> >
> > Johnny
> >
> >
> > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
> >
> > > Hi Johnny,
> > >
> > > I forgot to mention the bag will be varying sizes, so I can not use the
> > > method you described.
> > >
> > >
> > >
> > >
> > > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Hi, Chan:
> > > > I guess you might generate the bag like this
> > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
> > > > B = group A by f1;
> > > > C = foreach B generate *;
> > > > describe C;
> > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3:
> > > chararray)}}
> > > >
> > > > if this is the case, you can do:
> > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);