|
|
Mohammad Tariq 2012-05-21, 11:54
Hello list,
I have an Hdfs file that has 6 columns that contain some data stored in an Hbase table.the data looks like this -
18.98 2000 1.21 193.46 2.64 58.17 52.49 2000.5 4.32 947.11 2.74 64.45 115.24 2001 16.8 878.58 2.66 94.49 55.55 2001.5 33.03 656.56 2.82 60.76 156.14 2002 35.52 83.75 2.6 59.57 138.77 2002.5 21.51 105.76 2.62 85.89 71.89 2003 27.79 709.01 2.63 85.44 59.84 2003.5 32.1 444.82 2.72 70.8 103.18 2004 4.09 413.15 2.8 54.37
Now I have to take each record along with its next 4 records and do some processing(for example, in the first shot I have to take records 1-5, in the next shot I have to take 2-6 and so on)..I am trying to use TOP for this, but getting the following error -
2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 6, column 37> Invalid scalar projection: parameters : A column needs to be projected from a relation for it to be used as a scalar Details at logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
I am using following commands -
grunt> a = load 'hbase://logdata' >> using org.apache.pig.backend.hadoop.hbase.HBaseStorage( >> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') >> as (id, DGR, HD, POR, RES, RHOB, SON); grunt> b = foreach a { c = TOP(5,3,a); >> generate flatten(c); >> }
Could anyone tell me how to achieve that????Many thanks.
Regards, Mohammad Tariq
Ruslan Al-fakikh 2012-05-21, 16:01
Hey Mohammad,
Here c = TOP(5,3,a); you say: take 5 records out of a that have the biggest values in the third column. Do you really need that sorting by the third column?
-----Original Message----- From: Mohammad Tariq [mailto:[EMAIL PROTECTED]] Sent: Monday, May 21, 2012 3:54 PM To: [EMAIL PROTECTED] Subject: How to use TOP?
Hello list,
I have an Hdfs file that has 6 columns that contain some data stored in an Hbase table.the data looks like this -
18.98 2000 1.21 193.46 2.64 58.17 52.49 2000.5 4.32 947.11 2.74 64.45 115.24 2001 16.8 878.58 2.66 94.49 55.55 2001.5 33.03 656.56 2.82 60.76 156.14 2002 35.52 83.75 2.6 59.57 138.77 2002.5 21.51 105.76 2.62 85.89 71.89 2003 27.79 709.01 2.63 85.44 59.84 2003.5 32.1 444.82 2.72 70.8 103.18 2004 4.09 413.15 2.8 54.37
Now I have to take each record along with its next 4 records and do some processing(for example, in the first shot I have to take records 1-5, in the next shot I have to take 2-6 and so on)..I am trying to use TOP for this, but getting the following error -
2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 6, column 37> Invalid scalar projection: parameters : A column needs to be projected from a relation for it to be used as a scalar Details at logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
I am using following commands -
grunt> a = load 'hbase://logdata' >> using org.apache.pig.backend.hadoop.hbase.HBaseStorage( >> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id, >> DGR, HD, POR, RES, RHOB, SON); grunt> b = foreach a { c = TOP(5,3,a); >> generate flatten(c); >> }
Could anyone tell me how to achieve that????Many thanks.
Regards, Mohammad Tariq
Mohammad Tariq 2012-05-21, 17:33
Hi Ruslan,
Thanks for the response.I think I have made a mistake.Actually I just want the top 5 records each time.I don't have any sorting requirements.
Regards, Mohammad Tariq On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh <[EMAIL PROTECTED]> wrote: > Hey Mohammad, > > Here > c = TOP(5,3,a); > you say: take 5 records out of a that have the biggest values in the third > column. Do you really need that sorting by the third column? > > -----Original Message----- > From: Mohammad Tariq [mailto:[EMAIL PROTECTED]] > Sent: Monday, May 21, 2012 3:54 PM > To: [EMAIL PROTECTED] > Subject: How to use TOP? > > Hello list, > > I have an Hdfs file that has 6 columns that contain some data stored in an > Hbase table.the data looks like this - > > 18.98 2000 1.21 193.46 2.64 58.17 > 52.49 2000.5 4.32 947.11 2.74 64.45 > 115.24 2001 16.8 878.58 2.66 94.49 > 55.55 2001.5 33.03 656.56 2.82 60.76 > 156.14 2002 35.52 83.75 2.6 59.57 > 138.77 2002.5 21.51 105.76 2.62 85.89 > 71.89 2003 27.79 709.01 2.63 85.44 > 59.84 2003.5 32.1 444.82 2.72 70.8 > 103.18 2004 4.09 413.15 2.8 54.37 > > Now I have to take each record along with its next 4 records and do some > processing(for example, in the first shot I have to take records 1-5, in the > next shot I have to take 2-6 and so on)..I am trying to use TOP for this, > but getting the following error - > > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 1200: Pig script failed to parse: > <line 6, column 37> Invalid scalar projection: parameters : A column needs > to be projected from a relation for it to be used as a scalar Details at > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log > > I am using following commands - > > grunt> a = load 'hbase://logdata' >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage( >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id, >>> DGR, HD, POR, RES, RHOB, SON); > grunt> b = foreach a { c = TOP(5,3,a); >>> generate flatten(c); >>> } > > Could anyone tell me how to achieve that????Many thanks. > > Regards, > Mohammad Tariq >
Abhinav Neelam 2012-05-21, 19:46
Hey Mohammad,
You need to have sorting requirements when you say 'top 5' records. Because relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what parameter?' I'm unfamiliar with HBase, but if your data in HBase has an implicit ordering with say an auto-increment primary key, or an explicit one, you could include that field in your input to Pig and then apply TOP on that field.
Having said that, if I understand your problem correctly, you don't need TOP at all - you just want to process your input in groups of 5 tuples at a time. Again, I can't think of a way of doing this without modifying your input. For example, if your input included an extra field like this: 1 18.98 2000 1.21 193.46 2.64 58.17 1 52.49 2000.5 4.32 947.11 2.74 64.45 1 115.24 2001 16.8 878.58 2.66 94.49 1 55.55 2001.5 33.03 656.56 2.82 60.76 1 156.14 2002 35.52 83.75 2.6 59.57 2 138.77 2002.5 21.51 105.76 2.62 85.89 2 71.89 2003 27.79 709.01 2.63 85.44 2 59.84 2003.5 32.1 444.82 2.72 70.8 2 103.18 2004 4.09 413.15 2.8 54.37
you could do a group on that field and proceed. Even if you had a field like 'line number' or 'record number' in your input, you could still manipulate that field (say through integer division by 5) to use it for grouping. In any case, you need something to let Pig bring together your 5 tuple groups.
B = group A by $0; C = FOREACH B { <do some processing on your 5 tuple bag A> ...
Thanks, Abhinav
On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
> Hi Ruslan, > > Thanks for the response.I think I have made a mistake.Actually I > just want the top 5 records each time.I don't have any sorting > requirements. > > Regards, > Mohammad Tariq > > > On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh > <[EMAIL PROTECTED]> wrote: > > Hey Mohammad, > > > > Here > > c = TOP(5,3,a); > > you say: take 5 records out of a that have the biggest values in the > third > > column. Do you really need that sorting by the third column? > > > > -----Original Message----- > > From: Mohammad Tariq [mailto:[EMAIL PROTECTED]] > > Sent: Monday, May 21, 2012 3:54 PM > > To: [EMAIL PROTECTED] > > Subject: How to use TOP? > > > > Hello list, > > > > I have an Hdfs file that has 6 columns that contain some data stored in > an > > Hbase table.the data looks like this - > > > > 18.98 2000 1.21 193.46 2.64 58.17 > > 52.49 2000.5 4.32 947.11 2.74 64.45 > > 115.24 2001 16.8 878.58 2.66 94.49 > > 55.55 2001.5 33.03 656.56 2.82 60.76 > > 156.14 2002 35.52 83.75 2.6 59.57 > > 138.77 2002.5 21.51 105.76 2.62 85.89 > > 71.89 2003 27.79 709.01 2.63 85.44 > > 59.84 2003.5 32.1 444.82 2.72 70.8 > > 103.18 2004 4.09 413.15 2.8 54.37 > > > > Now I have to take each record along with its next 4 records and do some > > processing(for example, in the first shot I have to take records 1-5, in > the > > next shot I have to take 2-6 and so on)..I am trying to use TOP for this, > > but getting the following error - > > > > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt > > - ERROR 1200: Pig script failed to parse: > > <line 6, column 37> Invalid scalar projection: parameters : A column > needs > > to be projected from a relation for it to be used as a scalar Details at > > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log > > > > I am using following commands - > > > > grunt> a = load 'hbase://logdata' > >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage( > >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as (id, > >>> DGR, HD, POR, RES, RHOB, SON); > > grunt> b = foreach a { c = TOP(5,3,a); > >>> generate flatten(c);
Hacking is, and always has been, the Holy Grail of computer science.
Mohammad Tariq 2012-05-22, 07:13
Hi Abhinav,
Thanks a lot for the valuable response..Actually I was thinking of doing the same thing, but being new to Pig I thought of asking it on the mailing list first..As far as the data is concerned, second column will always be in ascending order.But I don't think it will be of any help..I think whatever you have suggested here would be the appropriate solution..Although I would like to ask you one thing..Is it feasible to add that first column having count in my pig script or do I have to change the data in my Hbase table itself???If yes then how can I achieve it in my script??Many thanks.
Regards, Mohammad Tariq On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]> wrote: > Hey Mohammad, > > You need to have sorting requirements when you say 'top 5' records. Because > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an > implicit ordering with say an auto-increment primary key, or an explicit > one, you could include that field in your input to Pig and then apply TOP > on that field. > > Having said that, if I understand your problem correctly, you don't need > TOP at all - you just want to process your input in groups of 5 tuples at a > time. Again, I can't think of a way of doing this without modifying your > input. For example, if your input included an extra field like this: > 1 18.98 2000 1.21 193.46 2.64 58.17 > 1 52.49 2000.5 4.32 947.11 2.74 64.45 > 1 115.24 2001 16.8 878.58 2.66 94.49 > 1 55.55 2001.5 33.03 656.56 2.82 60.76 > 1 156.14 2002 35.52 83.75 2.6 59.57 > 2 138.77 2002.5 21.51 105.76 2.62 85.89 > 2 71.89 2003 27.79 709.01 2.63 85.44 > 2 59.84 2003.5 32.1 444.82 2.72 70.8 > 2 103.18 2004 4.09 413.15 2.8 54.37 > > you could do a group on that field and proceed. Even if you had a field > like 'line number' or 'record number' in your input, you could still > manipulate that field (say through integer division by 5) to use it for > grouping. In any case, you need something to let Pig bring together your 5 > tuple groups. > > B = group A by $0; > C = FOREACH B { <do some processing on your 5 tuple bag A> ... > > Thanks, > Abhinav > > On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > >> Hi Ruslan, >> >> Thanks for the response.I think I have made a mistake.Actually I >> just want the top 5 records each time.I don't have any sorting >> requirements. >> >> Regards, >> Mohammad Tariq >> >> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh >> <[EMAIL PROTECTED]> wrote: >> > Hey Mohammad, >> > >> > Here >> > c = TOP(5,3,a); >> > you say: take 5 records out of a that have the biggest values in the >> third >> > column. Do you really need that sorting by the third column? >> > >> > -----Original Message----- >> > From: Mohammad Tariq [mailto:[EMAIL PROTECTED]] >> > Sent: Monday, May 21, 2012 3:54 PM >> > To: [EMAIL PROTECTED] >> > Subject: How to use TOP? >> > >> > Hello list, >> > >> > I have an Hdfs file that has 6 columns that contain some data stored in >> an >> > Hbase table.the data looks like this - >> > >> > 18.98 2000 1.21 193.46 2.64 58.17 >> > 52.49 2000.5 4.32 947.11 2.74 64.45 >> > 115.24 2001 16.8 878.58 2.66 94.49 >> > 55.55 2001.5 33.03 656.56 2.82 60.76 >> > 156.14 2002 35.52 83.75 2.6 59.57 >> > 138.77 2002.5 21.51 105.76 2.62 85.89 >> > 71.89 2003 27.79 709.01 2.63 85.44 >> > 59.84 2003.5 32.1 444.82 2.72 70.8 >> > 103.18 2004 4.09 413.15 2.8 54.37 >> > >> > Now I have to take each record along with its next 4 records and do some >> > processing(for example, in the first shot I have to take records 1-5, in
Abhinav Neelam 2012-05-22, 09:06
Doing it in the pig script is not feasible because pig doesn't have any notion of sequentiality - to maintain it, you'd need to have access to state that's shared globally by all the mappers and reducers. One way I can think of doing this is to have a UDF that maintains state - perhaps it can maintain a file that's NFS mounted/or in HDFS so that it's available on all the task nodes; then any call to the UDF can update that file (atomically) and return a 'row number' that you could associate with your current tuple. Something like: B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum;
However, AFAIK, you'd be better off doing it in HBase - perhaps at the time of record insert, you could also add a 'row number' into the record?
On 22 May 2012 12:43, Mohammad Tariq <[EMAIL PROTECTED]> wrote:
> Hi Abhinav, > > Thanks a lot for the valuable response..Actually I was thinking of > doing the same thing, but being new to Pig I thought of asking it on > the mailing list first..As far as the data is concerned, second column > will always be in ascending order.But I don't think it will be of any > help..I think whatever you have suggested here would be the > appropriate solution..Although I would like to ask you one thing..Is > it feasible to add that first column having count in my pig script or > do I have to change the data in my Hbase table itself???If yes then > how can I achieve it in my script??Many thanks. > > Regards, > Mohammad Tariq > > > On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]> > wrote: > > Hey Mohammad, > > > > You need to have sorting requirements when you say 'top 5' records. > Because > > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what > > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an > > implicit ordering with say an auto-increment primary key, or an explicit > > one, you could include that field in your input to Pig and then apply TOP > > on that field. > > > > Having said that, if I understand your problem correctly, you don't need > > TOP at all - you just want to process your input in groups of 5 tuples > at a > > time. Again, I can't think of a way of doing this without modifying your > > input. For example, if your input included an extra field like this: > > 1 18.98 2000 1.21 193.46 2.64 58.17 > > 1 52.49 2000.5 4.32 947.11 2.74 64.45 > > 1 115.24 2001 16.8 878.58 2.66 94.49 > > 1 55.55 2001.5 33.03 656.56 2.82 60.76 > > 1 156.14 2002 35.52 83.75 2.6 59.57 > > 2 138.77 2002.5 21.51 105.76 2.62 85.89 > > 2 71.89 2003 27.79 709.01 2.63 85.44 > > 2 59.84 2003.5 32.1 444.82 2.72 70.8 > > 2 103.18 2004 4.09 413.15 2.8 54.37 > > > > you could do a group on that field and proceed. Even if you had a field > > like 'line number' or 'record number' in your input, you could still > > manipulate that field (say through integer division by 5) to use it for > > grouping. In any case, you need something to let Pig bring together your > 5 > > tuple groups. > > > > B = group A by $0; > > C = FOREACH B { <do some processing on your 5 tuple bag A> ... > > > > Thanks, > > Abhinav > > > > On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > > > >> Hi Ruslan, > >> > >> Thanks for the response.I think I have made a mistake.Actually I > >> just want the top 5 records each time.I don't have any sorting > >> requirements. > >> > >> Regards, > >> Mohammad Tariq > >> > >> > >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh > >> <[EMAIL PROTECTED]> wrote: > >> > Hey Mohammad, > >> > > >> > Here > >> > c = TOP(5,3,a); > >> > you say: take 5 records out of a that have the biggest values in the > >> third > >> > column. Do you really need that sorting by the third column? > >> > > >> > -----Original Message----- > >> > From Hacking is, and always has been, the Holy Grail of computer science.
Mohammad Tariq 2012-05-22, 09:49
Yes, it would be better if I do it at the time of insertion.Just have to add one more column.Thanks again.
Regards, Mohammad Tariq On Tue, May 22, 2012 at 2:36 PM, Abhinav Neelam <[EMAIL PROTECTED]> wrote: > Doing it in the pig script is not feasible because pig doesn't have any > notion of sequentiality - to maintain it, you'd need to have access to > state that's shared globally by all the mappers and reducers. One way I can > think of doing this is to have a UDF that maintains state - perhaps it can > maintain a file that's NFS mounted/or in HDFS so that it's available on all > the task nodes; then any call to the UDF can update that file (atomically) > and return a 'row number' that you could associate with your current tuple. > Something like: > B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum; > > However, AFAIK, you'd be better off doing it in HBase - perhaps at the time > of record insert, you could also add a 'row number' into the record? > > On 22 May 2012 12:43, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > >> Hi Abhinav, >> >> Thanks a lot for the valuable response..Actually I was thinking of >> doing the same thing, but being new to Pig I thought of asking it on >> the mailing list first..As far as the data is concerned, second column >> will always be in ascending order.But I don't think it will be of any >> help..I think whatever you have suggested here would be the >> appropriate solution..Although I would like to ask you one thing..Is >> it feasible to add that first column having count in my pig script or >> do I have to change the data in my Hbase table itself???If yes then >> how can I achieve it in my script??Many thanks. >> >> Regards, >> Mohammad Tariq >> >> >> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <[EMAIL PROTECTED]> >> wrote: >> > Hey Mohammad, >> > >> > You need to have sorting requirements when you say 'top 5' records. >> Because >> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what >> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an >> > implicit ordering with say an auto-increment primary key, or an explicit >> > one, you could include that field in your input to Pig and then apply TOP >> > on that field. >> > >> > Having said that, if I understand your problem correctly, you don't need >> > TOP at all - you just want to process your input in groups of 5 tuples >> at a >> > time. Again, I can't think of a way of doing this without modifying your >> > input. For example, if your input included an extra field like this: >> > 1 18.98 2000 1.21 193.46 2.64 58.17 >> > 1 52.49 2000.5 4.32 947.11 2.74 64.45 >> > 1 115.24 2001 16.8 878.58 2.66 94.49 >> > 1 55.55 2001.5 33.03 656.56 2.82 60.76 >> > 1 156.14 2002 35.52 83.75 2.6 59.57 >> > 2 138.77 2002.5 21.51 105.76 2.62 85.89 >> > 2 71.89 2003 27.79 709.01 2.63 85.44 >> > 2 59.84 2003.5 32.1 444.82 2.72 70.8 >> > 2 103.18 2004 4.09 413.15 2.8 54.37 >> > >> > you could do a group on that field and proceed. Even if you had a field >> > like 'line number' or 'record number' in your input, you could still >> > manipulate that field (say through integer division by 5) to use it for >> > grouping. In any case, you need something to let Pig bring together your >> 5 >> > tuple groups. >> > >> > B = group A by $0; >> > C = FOREACH B { <do some processing on your 5 tuple bag A> ... >> > >> > Thanks, >> > Abhinav >> > >> > On 21 May 2012 23:03, Mohammad Tariq <[EMAIL PROTECTED]> wrote: >> > >> >> Hi Ruslan, >> >> >> >> Thanks for the response.I think I have made a mistake.Actually I >> >> just want the top 5 records each time.I don't have any sorting >> >> requirements. >> >> >> >> Regards, >> >> Mohammad Tariq >> >> >> >> >> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
|
|