Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> removing last item in a bag


+
Chan, Tim 2013-03-12, 23:33
+
Johnny Zhang 2013-03-12, 23:50
+
Chan, Tim 2013-03-13, 00:28
+
Johnny Zhang 2013-03-13, 00:40
+
Ruslan Al-Fakikh 2013-03-13, 03:06
Copy link to this message
-
Re: removing last item in a bag
Chan,

Sorry, I meant
ordered = ORDER inputData BY date;
not
ordered = ORDER inputData BY key;
On Wed, Mar 13, 2013 at 7:06 AM, Ruslan Al-Fakikh <[EMAIL PROTECTED]>wrote:

> Hi Chan,
>
> Your tasks seems to be not trivial in Pig. Basically bags are not ordered,
> so you have to either sort before or to decide what tuple you want to
> remove exactly. Some ways to solve the problem:
> 1) You can use the TOP builtin UDF which basically does the opposite and I
> am not sure whether it will suit you from the performance point of view
> 2) You can try something like this:
> inputData = LOAD 'input' AS (key: chararray, date: chararray, letter:
> chararray);
> grouped = GROUP inputData BY key;
> DESCRIBE grouped;
> DUMP grouped;
> withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count;
> DESCRIBE withCounts;
> DUMP withCounts;
> trimmed = FOREACH withCounts {
>         ordered = ORDER inputData BY key;
>         limited = LIMIT ordered (withCounts.Count - 1);
>         GENERATE
>                 group,
>                 limited;
> }
> DESCRIBE trimmed;
> DUMP trimmed;
>
> This is what I got when run on Pig 0.10:
>
> grouped: {group: chararray,inputData: {(key: chararray,date:
> chararray,letter: chararray)}}
>
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})
>
> withCounts: {group: chararray,inputData: {(key: chararray,date:
> chararray,letter: chararray)},Count: long}
>
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)
>
> trimmed: {group: chararray,limited: {(key: chararray,date:
> chararray,letter: chararray)}}
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})
>
> I am not sure whether it will perform well. Let me know if it helps.
>
> Best Regards,
> Ruslan Al-Fakikh
>
>
> On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]>wrote:
>
>> Hi, Chan:
>> That's fine. How did you generate the bag with different size in runtime.
>> It will be easier for me to come out a solution by this information.
>> Thanks.
>>
>> Johnny
>>
>>
>> On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
>>
>> > Hi Johnny,
>> >
>> > I forgot to mention the bag will be varying sizes, so I can not use the
>> > method you described.
>> >
>> >
>> >
>> >
>> > On Tue, Mar 12, 2013 at 4:50 PM, Johnny Zhang <[EMAIL PROTECTED]>
>> > wrote:
>> >
>> > > Hi, Chan:
>> > > I guess you might generate the bag like this
>> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
>> > > B = group A by f1;
>> > > C = foreach B generate *;
>> > > describe C;
>> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)},{(f3:
>> > chararray)}}
>> > >
>> > > if this is the case, you can do:
>> > > A = load 'test.txt' as (f1:chararray,f2:chararray,f3:chararray);
>> > > B = group A by f1;
>> > > C = foreach B generate group, A.f1, A.f2;
>> > > describe C;
>> > > C: {group: chararray,{(f1: chararray)},{(f2: chararray)}}
>> > >
>> > > does this make sense? otherwise can you share your script which
>> generates
>> > > the bag?
>> > >
>> > > Johnny Zhang
>> > >
>> > >
>> > > On Tue, Mar 12, 2013 at 4:33 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
>> > >
>> > > > How do I remove the last item in a bag.
>> > > >
>> > > > For example:
>> > > >
>> > > > (group_1,{(2012-12-15,a),(2012-12-17,a),(2012-12-23,c)})
>> > > >
>> > > >
>> > > > I would like to remove the last item so that the following is the
>> > result:
>> > > >
>> > > >
>> > > > (group_1,{(2012-12-15,a),(2012-12-17,a)})
>> > > >
>> > >
>> >
>>
>
>
+
Tim Chan 2013-03-13, 20:58
+
Ruslan Al-Fakikh 2013-03-15, 19:46