Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Bag of tuples


Copy link to this message
-
Re: Bag of tuples
Each element in A is not a Bag. A relation is a collection of tuples (just
like a bag). So each element in A is a tuple whose first element is a Bag.

If you want to order the tuples by id, you have to extract them from the
bag first.

A = LOAD 'data' ...;
B = FOREACH A GENERATE FLATTEN($0);
C = ORDER B BY $0;
DUMP C;

The error about “expression is not a project expression” is because you
started a FOREACH statement that is not ended by a GENERATE

If you want to find the top n tuples in a Bag you can use the TOP UDF.

A = LOAD 'data' AS (info: bag{t: (id, f1, f2, f3)});
B = FOREACH A GENERATE TOP(5, 'f1', A.info);
DUMP B;

I think I might have a syntax error in the above script, but you get the
general idea. The above strategy might only work if all the tuples in your
bag have the same schema. I'm not sure if `TOP` supports indices for
ordering field.

I also strongly recommend that you buy the Programming Pig book from
O’Riley written by Alan Gates and read it cover to cover (it’s a pretty
small book at about 200 pages). It explains basics of pig, advanced
techniques and optimization strategies. Not to mention it’s a fun read.
On Wed, Nov 6, 2013 at 2:38 PM, Sameer Tilak <[EMAIL PROTECTED]> wrote:

> Hi Alan,
> Thanks for your reply.
>
>
> I am trying to understand how Pig processes these relations. As I
> mentioned, my UDF returns the result in the following format;
>
>  {(id1,x,y,z), (id2, a, b, c), (id3,x,a)}  /* User 1 info */
>  {(id10,x,y,z), (id9, a, b, c), (id1,x,a)} /* User 2 info */
>  {(id8,x,y,z), (id4, a, b, c), (id2,x,a)} /* User 3 info */
>  {(id6,x,y,z), (id6, a, b, c), (id9,x,a)} /* User 4 info */
>
> B = foreach A { /* Each element in A is a bag. This will apply the
> following on each element within A that is each bag. */ Is this correct?
> B1 = order A by $0; -- order on the id /*What does this A refer to? Does
> it refer to it to each Bag of relationship A ? I get the following error:
> expression is not a project expression:
> /* rest of the code */
> }
>
> Thanks for your help.
>
>
> > Subject: Re: Bag of tuples
> > From: [EMAIL PROTECTED]
> > Date: Wed, 6 Nov 2013 09:36:04 -0800
> > To: [EMAIL PROTECTED]
> >
> > Do you mean you want to find the top 5 per input record?  Also, what is
> your ordering criteria?  Just sort by id?  Something like this should order
> all tuples in each bag by id and then produce the top 5.  My syntax may be
> a little off as I'm working offline and don't have the manual in front of
> me, but this should be the general idea.
> >
> > A = load 'yourinput' as (b:bag);
> > B = foreach A {
> >       B1 = order A by $0; -- order on the id
> >       B2 = limit B1 5;
> >       generate flatten(B2);
> > }
> >
> > Alan.
> >
> > On Nov 5, 2013, at 9:52 AM, Sameer Tilak wrote:
> >
> > > Hi Pig experts,
> > > Sorry to post so many questions, I have one more question on doing
> some analytics on bag of tuples.
> > >
> > > My input has the following format:
> > >
> > > {(id1,x,y,z), (id2, a, b, c), (id3,x,a)}  /* User 1 info */
> > > {(id10,x,y,z), (id9, a, b, c), (id1,x,a)} /* User 2 info */
> > > {(id8,x,y,z), (id4, a, b, c), (id2,x,a)} /* User 3 info */
> > > {(id6,x,y,z), (id6, a, b, c), (id9,x,a)} /* User 4 info */
> > >
> > > I can change my UDF to give more simple output. However, I want to
> find out if something like this can be done easily:
> > > I would like to find out top 5 ids (field 1 in a tuple) among all the
> users. Note that each user has a bag and the first field of each tuple in
> that bag is id.
> > >
> > > How difficult will it be to filter based on fields of tuples and do
> analytics across the entire user base.
> > >
> >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB