Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - How to use TOP?


Copy link to this message
-
Re: How to use TOP?
Mohammad Tariq 2012-05-22, 07:13
Hi Abhinav,

   Thanks a lot for the valuable response..Actually I was thinking of
doing the same thing, but being new to Pig I thought of asking it on
the mailing list first..As far as the data is concerned, second column
will always be in ascending order.But I don't think it will be of any
help..I think whatever you have suggested here would be the
appropriate solution..Although I would like to ask you one thing..Is
it feasible to add that first column having count in my pig script or
do I have to change the data in my Hbase table itself???If yes then
how can I achieve it in my script??Many thanks.

Regards,
    Mohammad Tariq
On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]> wrote:
> Hey Mohammad,
>
> You need to have sorting requirements when you say 'top 5' records. Because
> relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
> parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
> implicit ordering with say an auto-increment primary key, or an explicit
> one, you could include that field in your input to Pig and then apply TOP
> on that field.
>
> Having said that, if I understand your problem correctly, you don't need
> TOP at all - you just want to process your input in groups of 5 tuples at a
> time. Again, I can't think of a way of doing this without modifying your
> input. For example, if your input included an extra field like this:
> 1 18.98   2000             1.21   193.46  2.64        58.17
> 1 52.49   2000.5   4.32           947.11  2.74        64.45
> 1 115.24  2001             16.8   878.58  2.66        94.49
> 1 55.55   2001.5   33.03  656.56  2.82        60.76
> 1 156.14  2002             35.52  83.75   2.6         59.57
> 2 138.77  2002.5   21.51  105.76  2.62        85.89
> 2 71.89   2003             27.79  709.01  2.63        85.44
> 2 59.84   2003.5   32.1           444.82  2.72        70.8
> 2 103.18  2004             4.09   413.15  2.8         54.37
>
> you could do a group on that field and proceed. Even if you had a field
> like 'line number' or 'record number' in your input, you could still
> manipulate that field (say through integer division by 5) to use it for
> grouping. In any case, you need something to let Pig bring together your 5
> tuple groups.
>
> B = group A by $0;
> C = FOREACH B { <do some processing on your 5 tuple bag A> ...
>
> Thanks,
> Abhinav
>
> On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
>
>> Hi Ruslan,
>>
>>    Thanks for the response.I think I have made a mistake.Actually I
>> just want the top 5 records each time.I don't have any sorting
>> requirements.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
>> <[EMAIL PROTECTED]> wrote:
>> > Hey Mohammad,
>> >
>> > Here
>> > c = TOP(5,3,a);
>> > you say: take 5 records out of a that have the biggest values in the
>> third
>> > column. Do you really need that sorting by the third column?
>> >
>> > -----Original Message-----
>> > From: Mohammad Tariq [mailto:[EMAIL PROTECTED]]
>> > Sent: Monday, May 21, 2012 3:54 PM
>> > To: [EMAIL PROTECTED]
>> > Subject: How to use TOP?
>> >
>> > Hello list,
>> >
>> >  I have an Hdfs file that has 6 columns that contain some data stored in
>> an
>> > Hbase table.the data looks like this -
>> >
>> > 18.98   2000             1.21   193.46  2.64        58.17
>> > 52.49   2000.5   4.32           947.11  2.74        64.45
>> > 115.24  2001             16.8   878.58  2.66        94.49
>> > 55.55   2001.5   33.03  656.56  2.82        60.76
>> > 156.14  2002             35.52  83.75   2.6         59.57
>> > 138.77  2002.5   21.51  105.76  2.62        85.89
>> > 71.89   2003             27.79  709.01  2.63        85.44
>> > 59.84   2003.5   32.1           444.82  2.72        70.8
>> > 103.18  2004             4.09   413.15  2.8         54.37
>> >
>> > Now I have to take each record along with its next 4 records and do some
>> > processing(for example, in the first shot I have to take records 1-5, in