Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> removing last item in a bag


Copy link to this message
-
Re: removing last item in a bag
Hi Ruslan,

I'm using the trunk version of Pig.

For the following script:

test = LOAD '$test' USING PigStorage('\t') AS
    ( visitor:chararray,
      submodelid:long,
      record_datetime:chararray );

test_grp = group test by visitor;

-- add counts of each bag
test_grp_cnt = foreach test_grp
    generate
        *,
        COUNT(test) as submodel_count;
smp = filter test_grp_cnt by submodel_count < 2;
dump smp;
-- remove second to last item in back after sorting
test_last_removed = FOREACH test_grp_cnt {
    ordered = ORDER test BY record_datetime ASC;
    last_removed = LIMIT ordered (test_grp_cnt.submodel_count - 1);
    --last_removed = LIMIT ordered 3;

    GENERATE
        group as visitor,
        last_removed;
}
I get the following error:

ERROR 1066: Unable to open iterator for alias test_last_removed_smp.
Backend error : Scalar has more than one row in the output. 1st :
(uc3:3,{(uc3:3,200410586,2013-02-06 09:18:22),(uc3:3,200437662,2013-02-06
08:58:25),(uc3:3,200414442,2013-02-06 09:04:24)},3), 2nd
:(S:382290531917004,{(S:382290531917004,200442423,2013-02-01
21:15:58),(S:382290531917004,200409672,2013-02-01
21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3)

The error is not present when I comment out the "last_removed..." line and
uncommented out the one below it.
On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi Chan,
>
> Your tasks seems to be not trivial in Pig. Basically bags are not ordered,
> so you have to either sort before or to decide what tuple you want to
> remove exactly. Some ways to solve the problem:
> 1) You can use the TOP builtin UDF which basically does the opposite and I
> am not sure whether it will suit you from the performance point of view
> 2) You can try something like this:
> inputData = LOAD 'input' AS (key: chararray, date: chararray, letter:
> chararray);
> grouped = GROUP inputData BY key;
> DESCRIBE grouped;
> DUMP grouped;
> withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count;
> DESCRIBE withCounts;
> DUMP withCounts;
> trimmed = FOREACH withCounts {
>         ordered = ORDER inputData BY key;
>         limited = LIMIT ordered (withCounts.Count - 1);
>         GENERATE
>                 group,
>                 limited;
> }
> DESCRIBE trimmed;
> DUMP trimmed;
>
> This is what I got when run on Pig 0.10:
>
> grouped: {group: chararray,inputData: {(key: chararray,date:
> chararray,letter: chararray)}}
>
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})
>
> withCounts: {group: chararray,inputData: {(key: chararray,date:
> chararray,letter: chararray)},Count: long}
>
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)
>
> trimmed: {group: chararray,limited: {(key: chararray,date:
> chararray,letter: chararray)}}
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})
>
> I am not sure whether it will perform well. Let me know if it helps.
>
> Best Regards,
> Ruslan Al-Fakikh
>
>
> On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]>
> wrote:
>
> > Hi, Chan:
> > That's fine. How did you generate the bag with different size in runtime.
> > It will be easier for me to come out a solution by this information.
> > Thanks.
> >
> > Johnny
> >
> >
> > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
> >
> > > Hi Johnny,
> > >
> > > I forgot to mention the bag will be varying sizes, so I can not use the
> > > method you described.
> > >
> > >
> > >
> > >
> > > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Hi, Chan:
> > > > I guess you might generate the bag like this
> > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
> > > > B = group A by f1;
> > > > C = foreach B generate *;
> > > > describe C;
> > > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3:
> > > chararray)}}
> > > >
> > > > if this is the case, you can do:
> > > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB