Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> why the foreach nested form can't work?


Copy link to this message
-
Re: why the foreach nested form can't work?
On Tue, 2011-07-19 at 16:05 +0200, 勇胡 wrote:
> How can I understand that 'A.score' is a bag? I mean that if I issue a
> 'describe B' command, I can get B: {group:int, A: {name:chararray,
> no:int,score:int}}.
Looking at the output of describe shows that A is bag (eg. the '{' and
'}' characters), yes? So 'A.score' is simply the bag of all the scores
in the group. You can go further and get a bag of both the scores and
numbers by looking at 'A.(no, score)'. I admit that it _is_ confusing at
first.

> From here, I can't get any information that 'A.score' is
> a bag, but I can see that A.score is an element of bag.
Not true. 'score' is the name of the field. 'A.score' is a bag of just
the scores. Using the dot '.' is a way of pulling out specific fields
from every tuple within a bag to result in another bag. Consider:

A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
B = GROUP A BY no;
DUMP B;

(1,{(henrietta,1,25),(sally,1,82)})
(3,{(fred,3,120)})
(4,{(elsie,4,45)})

C = FOREACH B GENERATE A.score;
DUMP C;

({(25),(82)})
({(120)})
({(45)})

Got it?

> And why if I delete the quantifier 'A.', it works?
>
> I just changed my pig code as
>
> A = LOAD '/home/huyong/test/student.txt' AS (name:chararray, no:int, score:
> int);
> B = GROUP A BY no;
> C =  FOREACH B {
>     D = FILTER A BY score > 80;
>     GENERATE D.name, D.score;}
> DUMP C;
>
> I got an empty bag!
'D.name' and 'D.score' are bags of tuples. You will need to FLATTEN them
at the end as in the example

>
> The input is as:
> henrietta       1       25
> sally   1       82
> fred    3       120
> elsie   4       45
>
> The output is as:
> ({(sally)},{(82)})
> ({(fred)},{(120)})
> ({},{})
>
> As you see, I got an empty tuple? why?
There are three tuples, one for each group (1, 3, and 4). The filter
condition left the bags from group 4 empty since the only tuple,
(elsie,4,45) did not have a score > 80. If you FLATTEN the bags the
empty ones are discarded.

--jacob
@thedatachef

>
> Yong
>
> 2011/7/19 Jacob Perkins <[EMAIL PROTECTED]>
>
> > I think it's because 'A.score' is a bag but Pig needs a reference to a
> > field in the tuples. This worked for me:
> >
> > A = LOAD 'foo.tsv' AS (name:chararray, no:int, score: int);
> > B = GROUP A BY no;
> > C = FOREACH B {
> >       D = FILTER A BY score > 80;
> >      GENERATE FLATTEN(D.(name, score));
> >    };
> > DUMP C;
> >
> > on the following data:
> >
> > $: cat foo.tsv
> > henrietta       1       25
> > sally   1       82
> > fred    3       120
> > elsie   4       45
> >
> > yields:
> >
> >
> > Does that work for you?
> >
> > --jacob
> > @thedatachef
> >
> > On Tue, 2011-07-19 at 15:00 +0200, 勇胡 wrote:
> > > A = LOAD '/home/test/student.txt' AS (name:chararray, no:int, score:
> > > int);
> > > B = GROUP A BY no;
> > > C =  FOREACH B {
> > >     D = FILTER A BY A.score > 80;
> > >     GENERATE D.name, D.score;}
> > > DUMP C;
> >
> >
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB