Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to use TOP?


Copy link to this message
-
Re: How to use TOP?
Doing it in the pig script is not feasible because pig doesn't have any
notion of sequentiality - to maintain it, you'd need to have access to
state that's shared globally by all the mappers and reducers. One way I can
think of doing this is to have a UDF that maintains state - perhaps it can
maintain a file that's NFS mounted/or in HDFS so that it's available on all
the task nodes; then any call to the UDF can update that file (atomically)
and return a 'row number' that you could associate with your current tuple.
Something like:
B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum;

However, AFAIK, you'd be better off doing it in HBase - perhaps at the time
of record insert, you could also add a 'row number' into the record?

On 22 May 2012 12:43, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hi Abhinav,
>
>   Thanks a lot for the valuable response..Actually I was thinking of
> doing the same thing, but being new to Pig I thought of asking it on
> the mailing list first..As far as the data is concerned, second column
> will always be in ascending order.But I don't think it will be of any
> help..I think whatever you have suggested here would be the
> appropriate solution..Although I would like to ask you one thing..Is
> it feasible to add that first column having count in my pig script or
> do I have to change the data in my Hbase table itself???If yes then
> how can I achieve it in my script??Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]>
> wrote:
> > Hey Mohammad,
> >
> > You need to have sorting requirements when you say 'top 5' records.
> Because
> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
> > implicit ordering with say an auto-increment primary key, or an explicit
> > one, you could include that field in your input to Pig and then apply TOP
> > on that field.
> >
> > Having said that, if I understand your problem correctly, you don't need
> > TOP at all - you just want to process your input in groups of 5 tuples
> at a
> > time. Again, I can't think of a way of doing this without modifying your
> > input. For example, if your input included an extra field like this:
> > 1 18.98   2000             1.21   193.46  2.64        58.17
> > 1 52.49   2000.5   4.32           947.11  2.74        64.45
> > 1 115.24  2001             16.8   878.58  2.66        94.49
> > 1 55.55   2001.5   33.03  656.56  2.82        60.76
> > 1 156.14  2002             35.52  83.75   2.6         59.57
> > 2 138.77  2002.5   21.51  105.76  2.62        85.89
> > 2 71.89   2003             27.79  709.01  2.63        85.44
> > 2 59.84   2003.5   32.1           444.82  2.72        70.8
> > 2 103.18  2004             4.09   413.15  2.8         54.37
> >
> > you could do a group on that field and proceed. Even if you had a field
> > like 'line number' or 'record number' in your input, you could still
> > manipulate that field (say through integer division by 5) to use it for
> > grouping. In any case, you need something to let Pig bring together your
> 5
> > tuple groups.
> >
> > B = group A by $0;
> > C = FOREACH B { <do some processing on your 5 tuple bag A> ...
> >
> > Thanks,
> > Abhinav
> >
> > On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
> >
> >> Hi Ruslan,
> >>
> >>    Thanks for the response.I think I have made a mistake.Actually I
> >> just want the top 5 records each time.I don't have any sorting
> >> requirements.
> >>
> >> Regards,
> >>     Mohammad Tariq
> >>
> >>
> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
> >> <[EMAIL PROTECTED]> wrote:
> >> > Hey Mohammad,
> >> >
> >> > Here
> >> > c = TOP(5,3,a);
> >> > you say: take 5 records out of a that have the biggest values in the
> >> third
> >> > column. Do you really need that sorting by the third column?
> >> >
> >> > -----Original Message-----
> >> > From
Hacking is, and always has been, the Holy
Grail of computer science.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB