Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> How to use TOP?


Copy link to this message
-
Re: How to use TOP?
Hi Abhinav,

   Thanks a lot for the valuable response..Actually I was thinking of
doing the same thing, but being new to Pig I thought of asking it on
the mailing list first..As far as the data is concerned, second column
will always be in ascending order.But I don't think it will be of any
help..I think whatever you have suggested here would be the
appropriate solution..Although I would like to ask you one thing..Is
it feasible to add that first column having count in my pig script or
do I have to change the data in my Hbase table itself???If yes then
how can I achieve it in my script??Many thanks.

Regards,
    Mohammad Tariq
On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]> wrote:
> Hey Mohammad,
>
> You need to have sorting requirements when you say 'top 5' records. Because
> relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
> parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
> implicit ordering with say an auto-increment primary key, or an explicit
> one, you could include that field in your input to Pig and then apply TOP
> on that field.
>
> Having said that, if I understand your problem correctly, you don't need
> TOP at all - you just want to process your input in groups of 5 tuples at a
> time. Again, I can't think of a way of doing this without modifying your
> input. For example, if your input included an extra field like this:
> 1 18.98   2000             1.21   193.46  2.64        58.17
> 1 52.49   2000.5   4.32           947.11  2.74        64.45
> 1 115.24  2001             16.8   878.58  2.66        94.49
> 1 55.55   2001.5   33.03  656.56  2.82        60.76
> 1 156.14  2002             35.52  83.75   2.6         59.57
> 2 138.77  2002.5   21.51  105.76  2.62        85.89
> 2 71.89   2003             27.79  709.01  2.63        85.44
> 2 59.84   2003.5   32.1           444.82  2.72        70.8
> 2 103.18  2004             4.09   413.15  2.8         54.37
>
> you could do a group on that field and proceed. Even if you had a field
> like 'line number' or 'record number' in your input, you could still
> manipulate that field (say through integer division by 5) to use it for
> grouping. In any case, you need something to let Pig bring together your 5
> tuple groups.
>
> B = group A by $0;
> C = FOREACH B { <do some processing on your 5 tuple bag A> ...
>
> Thanks,
> Abhinav
>
> On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
>
>> Hi Ruslan,
>>
>>    Thanks for the response.I think I have made a mistake.Actually I
>> just want the top 5 records each time.I don't have any sorting
>> requirements.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
>> <[EMAIL PROTECTED]> wrote:
>> > Hey Mohammad,
>> >
>> > Here
>> > c = TOP(5,3,a);
>> > you say: take 5 records out of a that have the biggest values in the
>> third
>> > column. Do you really need that sorting by the third column?
>> >
>> > -----Original Message-----
>> > From: Mohammad Tariq [mailto:[EMAIL PROTECTED]]
>> > Sent: Monday, May 21, 2012 3:54 PM
>> > To: [EMAIL PROTECTED]
>> > Subject: How to use TOP?
>> >
>> > Hello list,
>> >
>> >  I have an Hdfs file that has 6 columns that contain some data stored in
>> an
>> > Hbase table.the data looks like this -
>> >
>> > 18.98   2000             1.21   193.46  2.64        58.17
>> > 52.49   2000.5   4.32           947.11  2.74        64.45
>> > 115.24  2001             16.8   878.58  2.66        94.49
>> > 55.55   2001.5   33.03  656.56  2.82        60.76
>> > 156.14  2002             35.52  83.75   2.6         59.57
>> > 138.77  2002.5   21.51  105.76  2.62        85.89
>> > 71.89   2003             27.79  709.01  2.63        85.44
>> > 59.84   2003.5   32.1           444.82  2.72        70.8
>> > 103.18  2004             4.09   413.15  2.8         54.37
>> >
>> > Now I have to take each record along with its next 4 records and do some
>> > processing(for example, in the first shot I have to take records 1-5, in
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB