Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> removing last item in a bag


Copy link to this message
-
Re: removing last item in a bag
Hi Chan,

Your tasks seems to be not trivial in Pig. Basically bags are not ordered,
so you have to either sort before or to decide what tuple you want to
remove exactly. Some ways to solve the problem:
1) You can use the TOP builtin UDF which basically does the opposite and I
am not sure whether it will suit you from the performance point of view
2) You can try something like this:
inputData = LOAD 'input' AS (key: chararray, date: chararray, letter:
chararray);
grouped = GROUP inputData BY key;
DESCRIBE grouped;
DUMP grouped;
withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count;
DESCRIBE withCounts;
DUMP withCounts;
trimmed = FOREACH withCounts {
        ordered = ORDER inputData BY key;
        limited = LIMIT ordered (withCounts.Count - 1);
        GENERATE
                group,
                limited;
}
DESCRIBE trimmed;
DUMP trimmed;

This is what I got when run on Pig 0.10:

grouped: {group: chararray,inputData: {(key: chararray,date:
chararray,letter: chararray)}}
(group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})

withCounts: {group: chararray,inputData: {(key: chararray,date:
chararray,letter: chararray)},Count: long}
(group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)

trimmed: {group: chararray,limited: {(key: chararray,date:
chararray,letter: chararray)}}
(group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})

I am not sure whether it will perform well. Let me know if it helps.

Best Regards,
Ruslan Al-Fakikh
On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]> wrote:

> Hi, Chan:
> That's fine. How did you generate the bag with different size in runtime.
> It will be easier for me to come out a solution by this information.
> Thanks.
>
> Johnny
>
>
> On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
>
> > Hi Johnny,
> >
> > I forgot to mention the bag will be varying sizes, so I can not use the
> > method you described.
> >
> >
> >
> >
> > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi, Chan:
> > > I guess you might generate the bag like this
> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
> > > B = group A by f1;
> > > C = foreach B generate *;
> > > describe C;
> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3:
> > chararray)}}
> > >
> > > if this is the case, you can do:
> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
> > > B = group A by f1;
> > > C = foreach B generate group, A.f1, A.f2;
> > > describe C;
> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}}
> > >
> > > does this make sense? otherwise can you share your script which
> generates
> > > the bag?
> > >
> > > Johnny Zhang
> > >
> > >
> > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
> > >
> > > > How do I remove the last item in a bag.
> > > >
> > > > For example:
> > > >
> > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)})
> > > >
> > > >
> > > > I would like to remove the last item so that the following is the
> > result:
> > > >
> > > >
> > > > (group_1,{(2012-12-15,a),(2012-12-17,a)})
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB