Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> why the foreach nested form can't work?


Copy link to this message
-
Re: why the foreach nested form can't work?
On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> How can I understand that 'A.score' is a bag? I mean that if I issue a
> 'describe B' command, I can get B: {group:int, A: {name:chararray,
> no:int,score:int}}.
Looking at the output of describe shows that A is bag (eg. the '{' and
'}' characters), yes? So 'A.score' is simply the bag of all the scores
in the group. You can go further and get a bag of both the scores and
numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
first.

> From here, I can't get any information that 'A.score' is
> a bag, but I can see that A.score is an element of bag.
Not true. 'score' is the name of the field. 'A.score' is a bag of just
the scores. Using the dot '.' is a way of pulling out specific fields
from every tuple within a bag to result in another bag. Consider:

A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
B = GROUP A BY no;
DUMP B;

(1,{(henrietta,1,25),(sally,1,82)})
(3,{(fred,3,120)})
(4,{(elsie,4,45)})

C = FOREACH B GENERATE A.score;
DUMP C;

({(25),(82)})
({(120)})
({(45)})

Got it?

> And why if I delete the quantifier 'A.', it works?
>
> I just changed my pig code as
>
> A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = GROUP A BY no;
> C =  FOREACH B {
>     D = FILTER A BY score > 80;
>     GENERATE D.name, D.score;}
> DUMP C;
>
> I got an empty bag!
'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
at the end as in the example

>
> The input is as:
> henrietta       1       25
> sally   1       82
> fred    3       120
> elsie   4       45
>
> The output is as:
> ({(sally)},{(82)})
> ({(fred)},{(120)})
> ({},{})
>
> As you see, I got an empty tuple? why?
There are three tuples, one for each group (1, 3, and 4). The filter
condition left the bags from group 4 empty since the only tuple,
(elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
empty ones are discarded.

--jacob
@thedatachef

>
> Yong
>
> 2011/7/19 Jacob Perkins <[EMAIL PROTECTED]>
>
> > I think it's because 'A.score' is a bag but Pig needs a reference to a
> > field in the tuples. This worked for me:
> >
> > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > B = GROUP A BY no;
> > C = FOREACH B {
> >       D = FILTER A BY score > 80;
> >      GENERATE FLATTEN(D.(name, score));
> >    };
> > DUMP C;
> >
> > on the following data:
> >
> > $: cat foo.tsv
> > henrietta       1       25
> > sally   1       82
> > fred    3       120
> > elsie   4       45
> >
> > yields:
> >
> >
> > Does that work for you?
> >
> > --jacob
> > @thedatachef
> >
> > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > > int);
> > > B = GROUP A BY no;
> > > C =  FOREACH B {
> > >     D = FILTER A BY A.score > 80;
> > >     GENERATE D.name, D.score;}
> > > DUMP C;
> >
> >