Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - removing last item in a bag


+
Chan, Tim 2013-03-12, 23:33
+
Johnny Zhang 2013-03-12, 23:50
+
Chan, Tim 2013-03-13, 00:28
+
Johnny Zhang 2013-03-13, 00:40
+
Ruslan Al-Fakikh 2013-03-13, 03:06
+
Ruslan Al-Fakikh 2013-03-13, 03:09
+
Tim Chan 2013-03-13, 20:58
Copy link to this message
-
Re: removing last item in a bag
Ruslan Al-Fakikh 2013-03-15, 19:46
oh, sorry. It seems that my script worked only for the case where we have
only 1 group. Basically here
withCounts.Count
I wanted to get access to the Count field in the context of the row being
processed and it should be only 1 for one row, but with withCounts.Count it
seems that it actually accesses the outer context and sees many rows in
withCounts.
Maybe someone else has any idea?
On Thu, Mar 14, 2013 at 12:58 AM, Tim Chan <[EMAIL PROTECTED]> wrote:

> Hi Ruslan,
>
> I'm using the trunk version of Pig.
>
> For the following script:
>
> test = LOAD '$test' USING PigStorage('\t') AS
>     ( visitor:chararray,
>       submodelid:long,
>       record_datetime:chararray );
>
> test_grp = group test by visitor;
>
> -- add counts of each bag
> test_grp_cnt = foreach test_grp
>     generate
>         *,
>         COUNT(test) as submodel_count;
>
>
> smp = filter test_grp_cnt by submodel_count < 2;
> dump smp;
>
>
> -- remove second to last item in back after sorting
> test_last_removed = FOREACH test_grp_cnt {
>     ordered = ORDER test BY record_datetime ASC;
>     last_removed = LIMIT ordered (test_grp_cnt.submodel_count - 1);
>     --last_removed = LIMIT ordered 3;
>
>     GENERATE
>         group as visitor,
>         last_removed;
> }
>
>
> I get the following error:
>
> ERROR 1066: Unable to open iterator for alias test_last_removed_smp.
> Backend error : Scalar has more than one row in the output. 1st :
> (uc3:3,{(uc3:3,200410586,2013-02-06 09:18:22),(uc3:3,200437662,2013-02-06
> 08:58:25),(uc3:3,200414442,2013-02-06 09:04:24)},3), 2nd
> :(S:382290531917004,{(S:382290531917004,200442423,2013-02-01
> 21:15:58),(S:382290531917004,200409672,2013-02-01
> 21:29:45),(S:382290531917004,200443484,2013-02-01 21:24:19)},3)
>
> The error is not present when I comment out the "last_removed..." line and
> uncommented out the one below it.
>
>
>
>
> On Tue, Mar 12, 2013 at 8:06 PM, Ruslan Al-Fakikh <[EMAIL PROTECTED]
> >wrote:
>
> > Hi Chan,
> >
> > Your tasks seems to be not trivial in Pig. Basically bags are not
> ordered,
> > so you have to either sort before or to decide what tuple you want to
> > remove exactly. Some ways to solve the problem:
> > 1) You can use the TOP builtin UDF which basically does the opposite and
> I
> > am not sure whether it will suit you from the performance point of view
> > 2) You can try something like this:
> > inputData = LOAD 'input' AS (key: chararray, date: chararray, letter:
> > chararray);
> > grouped = GROUP inputData BY key;
> > DESCRIBE grouped;
> > DUMP grouped;
> > withCounts = FOREACH grouped GENERATE *, COUNT(inputData) AS Count;
> > DESCRIBE withCounts;
> > DUMP withCounts;
> > trimmed = FOREACH withCounts {
> >         ordered = ORDER inputData BY key;
> >         limited = LIMIT ordered (withCounts.Count - 1);
> >         GENERATE
> >                 group,
> >                 limited;
> > }
> > DESCRIBE trimmed;
> > DUMP trimmed;
> >
> > This is what I got when run on Pig 0.10:
> >
> > grouped: {group: chararray,inputData: {(key: chararray,date:
> > chararray,letter: chararray)}}
> >
> >
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)})
> >
> > withCounts: {group: chararray,inputData: {(key: chararray,date:
> > chararray,letter: chararray)},Count: long}
> >
> >
> (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a),(group_1,2012-12-23,c)},3)
> >
> > trimmed: {group: chararray,limited: {(key: chararray,date:
> > chararray,letter: chararray)}}
> > (group_1,{(group_1,2012-12-15,a),(group_1,2012-12-17,a)})
> >
> > I am not sure whether it will perform well. Let me know if it helps.
> >
> > Best Regards,
> > Ruslan Al-Fakikh
> >
> >
> > On Wed, Mar 13, 2013 at 4:40 AM, Johnny Zhang <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi, Chan:
> > > That's fine. How did you generate the bag with different size in
> runtime.
> > > It will be easier for me to come out a solution by this information.
> > > Thanks.
> > >
> > > Johnny
> > >
> > >
> > > On Tue, Mar 12, 2013 at 5:28 PM, Chan, Tim <[EMAIL PROTECTED]> wrote: