Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> How to use TOP?


+
Mohammad Tariq 2012-05-21, 11:54
+
Ruslan Al-fakikh 2012-05-21, 16:01
+
Mohammad Tariq 2012-05-21, 17:33
+
Abhinav Neelam 2012-05-21, 19:46
+
Mohammad Tariq 2012-05-22, 07:13
Copy link to this message
-
Re: How to use TOP?
Doing it in the pig script is not feasible because pig doesn't have any
notion of sequentiality - to maintain it, you'd need to have access to
state that's shared globally by all the mappers and reducers. One way I can
think of doing this is to have a UDF that maintains state - perhaps it can
maintain a file that's NFS mounted/or in HDFS so that it's available on all
the task nodes; then any call to the UDF can update that file (atomically)
and return a 'row number' that you could associate with your current tuple.
Something like:
B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum;

However, AFAIK, you'd be better off doing it in HBase - perhaps at the time
of record insert, you could also add a 'row number' into the record?

On 22 May 2012 12:43, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hi Abhinav,
>
>   Thanks a lot for the valuable response..Actually I was thinking of
> doing the same thing, but being new to Pig I thought of asking it on
> the mailing list first..As far as the data is concerned, second column
> will always be in ascending order.But I don't think it will be of any
> help..I think whatever you have suggested here would be the
> appropriate solution..Although I would like to ask you one thing..Is
> it feasible to add that first column having count in my pig script or
> do I have to change the data in my Hbase table itself???If yes then
> how can I achieve it in my script??Many thanks.
>
> Regards,
>     Mohammad Tariq
>
>
> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]>
> wrote:
> > Hey Mohammad,
> >
> > You need to have sorting requirements when you say 'top 5' records.
> Because
> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
> > implicit ordering with say an auto-increment primary key, or an explicit
> > one, you could include that field in your input to Pig and then apply TOP
> > on that field.
> >
> > Having said that, if I understand your problem correctly, you don't need
> > TOP at all - you just want to process your input in groups of 5 tuples
> at a
> > time. Again, I can't think of a way of doing this without modifying your
> > input. For example, if your input included an extra field like this:
> > 1 18.98   2000             1.21   193.46  2.64        58.17
> > 1 52.49   2000.5   4.32           947.11  2.74        64.45
> > 1 115.24  2001             16.8   878.58  2.66        94.49
> > 1 55.55   2001.5   33.03  656.56  2.82        60.76
> > 1 156.14  2002             35.52  83.75   2.6         59.57
> > 2 138.77  2002.5   21.51  105.76  2.62        85.89
> > 2 71.89   2003             27.79  709.01  2.63        85.44
> > 2 59.84   2003.5   32.1           444.82  2.72        70.8
> > 2 103.18  2004             4.09   413.15  2.8         54.37
> >
> > you could do a group on that field and proceed. Even if you had a field
> > like 'line number' or 'record number' in your input, you could still
> > manipulate that field (say through integer division by 5) to use it for
> > grouping. In any case, you need something to let Pig bring together your
> 5
> > tuple groups.
> >
> > B = group A by $0;
> > C = FOREACH B { <do some processing on your 5 tuple bag A> ...
> >
> > Thanks,
> > Abhinav
> >
> > On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
> >
> >> Hi Ruslan,
> >>
> >>    Thanks for the response.I think I have made a mistake.Actually I
> >> just want the top 5 records each time.I don't have any sorting
> >> requirements.
> >>
> >> Regards,
> >>     Mohammad Tariq
> >>
> >>
> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
> >> <[EMAIL PROTECTED]> wrote:
> >> > Hey Mohammad,
> >> >
> >> > Here
> >> > c = TOP(5,3,a);
> >> > you say: take 5 records out of a that have the biggest values in the
> >> third
> >> > column. Do you really need that sorting by the third column?
> >> >
> >> > -----Original Message-----
> >> > From
Hacking is, and always has been, the Holy
Grail of computer science.
+
Mohammad Tariq 2012-05-22, 09:49