Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Bag of tuples


Copy link to this message
-
Re: Bag of tuples
Do you mean you want to find the top 5 per input record?  Also, what is your ordering criteria?  Just sort by id?  Something like this should order all tuples in each bag by id and then produce the top 5.  My syntax may be a little off as I'm working offline and don't have the manual in front of me, but this should be the general idea.

A = load 'yourinput' as (b:bag);
B = foreach A {
B1 = order A by $0; -- order on the id
B2 = limit B1 5;
generate flatten(B2);
}

Alan.

On Nov 5, 2013, at 9:52 AM, Sameer Tilak wrote:

> Hi Pig experts,
> Sorry to post so many questions, I have one more question on doing some analytics on bag of tuples.
>
> My input has the following format:
>
> {(id1,x,y,z), (id2, a, b, c), (id3,x,a)}  /* User 1 info */
> {(id10,x,y,z), (id9, a, b, c), (id1,x,a)} /* User 2 info */
> {(id8,x,y,z), (id4, a, b, c), (id2,x,a)} /* User 3 info */
> {(id6,x,y,z), (id6, a, b, c), (id9,x,a)} /* User 4 info */
>
> I can change my UDF to give more simple output. However, I want to find out if something like this can be done easily:
> I would like to find out top 5 ids (field 1 in a tuple) among all the users. Note that each user has a bag and the first field of each tuple in that bag is id.
>
> How difficult will it be to filter based on fields of tuples and do analytics across the entire user base.
>    
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB