|
Wojciech Langiewicz
2011-05-01, 13:44
Michel Segel
2011-05-01, 14:44
Doug Meil
2011-05-01, 17:55
Himanshu Vashishtha
2011-05-01, 18:03
Wojciech Langiewicz
2011-05-01, 18:11
Wojciech Langiewicz
2011-05-01, 18:29
Himanshu Vashishtha
2011-05-01, 18:42
Doug Meil
2011-05-01, 18:44
Wojciech Langiewicz
2011-05-01, 18:49
Wojciech Langiewicz
2011-05-01, 18:51
|
-
Row count without iterating over ResultScanner?Wojciech Langiewicz 2011-05-01, 13:44
Hi,
I would like to know if there's a way to quickly count number of rows from scan result? Right now I'm iterating over ResultScanner like this: int count = 0; for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { ++count; } But with number of rows reaching millions this takes a while. I tried to find something in documentation, but I didn't found anything. I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed). Thanks for all help. -- Wojciech Langiewicz
-
Re: Row count without iterating over ResultScanner?Michel Segel 2011-05-01, 14:44
Hi,
There's a row counter app in the hbase release that's a m/r job. You could also do a dynamic counter too. Sent from a remote device. Please excuse any typos... Mike Segel On May 1, 2011, at 8:44 AM, Wojciech Langiewicz <[EMAIL PROTECTED]> wrote: > Hi, > I would like to know if there's a way to quickly count number of rows from scan result? > Right now I'm iterating over ResultScanner like this: > int count = 0; > for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { > ++count; > } > But with number of rows reaching millions this takes a while. > I tried to find something in documentation, but I didn't found anything. > I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed). > > Thanks for all help. > > -- > Wojciech Langiewicz >
-
RE: Row count without iterating over ResultScanner?Doug Meil 2011-05-01, 17:55
What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504
Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)." For Very Large tables you want to start using an MR job for this. -----Original Message----- From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] Sent: Sunday, May 01, 2011 9:44 AM To: [EMAIL PROTECTED] Subject: Row count without iterating over ResultScanner? Hi, I would like to know if there's a way to quickly count number of rows from scan result? Right now I'm iterating over ResultScanner like this: int count = 0; for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { ++count; } But with number of rows reaching millions this takes a while. I tried to find something in documentation, but I didn't found anything. I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed). Thanks for all help. -- Wojciech Langiewicz
-
Re: Row count without iterating over ResultScanner?Himanshu Vashishtha 2011-05-01, 18:03
If you are interested row count only (and not want to fetch the table rows
to your client side), you can also try out https://issues.apache.org/jira/browse/HBASE-1512. PS: Which version you are on? The above patch is in main trunk as of now, so to use it you would have to checkout the code and build it. Thanks, Himanshu On Sun, May 1, 2011 at 11:55 AM, Doug Meil <[EMAIL PROTECTED]>wrote: > What caching value are you using on the scan? If you aren't setting this, > it's probably using the default - which is 1. Which is slow. > http://hbase.apache.org/book.html#d379e3504 > > Re: "I would like to use HBase API, not MR job (because this cluster only > has HDFS and HBase installed)." > > For Very Large tables you want to start using an MR job for this. > > > -----Original Message----- > From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] > Sent: Sunday, May 01, 2011 9:44 AM > To: [EMAIL PROTECTED] > Subject: Row count without iterating over ResultScanner? > > Hi, > I would like to know if there's a way to quickly count number of rows from > scan result? > Right now I'm iterating over ResultScanner like this: > int count = 0; > for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { > ++count; > } > But with number of rows reaching millions this takes a while. > I tried to find something in documentation, but I didn't found anything. > I would like to use HBase API, not MR job (because this cluster only has > HDFS and HBase installed). > > Thanks for all help. > > -- > Wojciech Langiewicz >
-
Re: Row count without iterating over ResultScanner?Wojciech Langiewicz 2011-05-01, 18:11
Yes, I was using default caching, setting this value to few thousands
made significant difference in performance, I'll experiment more with this option. Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max). Thanks for the tip on caching! On 01.05.2011 19:55, Doug Meil wrote: > What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504 > > Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)." > > For Very Large tables you want to start using an MR job for this. > > > -----Original Message----- > From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] > Sent: Sunday, May 01, 2011 9:44 AM > To: [EMAIL PROTECTED] > Subject: Row count without iterating over ResultScanner? > > Hi, > I would like to know if there's a way to quickly count number of rows from scan result? > Right now I'm iterating over ResultScanner like this: > int count = 0; > for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { > ++count; > } > But with number of rows reaching millions this takes a while. > I tried to find something in documentation, but I didn't found anything. > I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed). > > Thanks for all help. > > -- > Wojciech Langiewicz
-
Re: Row count without iterating over ResultScanner?Wojciech Langiewicz 2011-05-01, 18:29
Hi,
On 01.05.2011 20:03, Himanshu Vashishtha wrote: > If you are interested row count only (and not want to fetch the table rows > to your client side), you can also try out > https://issues.apache.org/jira/browse/HBASE-1512. Yes, I only want to count rows and apply filters or select columns. Are filters also supported to work with those aggregate functions? > PS: Which version you are on? The above patch is in main trunk as of now, so > to use it you would have to checkout the code and build it. I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to this version. Coprocessors with aggregate functions seem to be the thing I need. Thanks! -- Wojciech Langiewicz > Thanks, > Himanshu > > > On Sun, May 1, 2011 at 11:55 AM, Doug Meil<[EMAIL PROTECTED]>wrote: > >> What caching value are you using on the scan? If you aren't setting this, >> it's probably using the default - which is 1. Which is slow. >> http://hbase.apache.org/book.html#d379e3504 >> >> Re: "I would like to use HBase API, not MR job (because this cluster only >> has HDFS and HBase installed)." >> >> For Very Large tables you want to start using an MR job for this. >> >> >> -----Original Message----- >> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] >> Sent: Sunday, May 01, 2011 9:44 AM >> To: [EMAIL PROTECTED] >> Subject: Row count without iterating over ResultScanner? >> >> Hi, >> I would like to know if there's a way to quickly count number of rows from >> scan result? >> Right now I'm iterating over ResultScanner like this: >> int count = 0; >> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { >> ++count; >> } >> But with number of rows reaching millions this takes a while. >> I tried to find something in documentation, but I didn't found anything. >> I would like to use HBase API, not MR job (because this cluster only has >> HDFS and HBase installed). >> >> Thanks for all help. >> >> -- >> Wojciech Langiewicz >> >
-
Re: Row count without iterating over ResultScanner?Himanshu Vashishtha 2011-05-01, 18:42
Yes, you can define your scan object at the client side and pass to the
AggregateClient.rowCount. You can refer to AggregateClient javadoc and associated TestAggregateProtocol test methods to get an idea. Thanks, Himanshu On Sun, May 1, 2011 at 12:29 PM, Wojciech Langiewicz <[EMAIL PROTECTED]>wrote: > Hi, > > On 01.05.2011 20:03, Himanshu Vashishtha wrote: > >> If you are interested row count only (and not want to fetch the table rows >> to your client side), you can also try out >> https://issues.apache.org/jira/browse/HBASE-1512. >> > > Yes, I only want to count rows and apply filters or select columns. > Are filters also supported to work with those aggregate functions? > > > PS: Which version you are on? The above patch is in main trunk as of now, >> so >> to use it you would have to checkout the code and build it. >> > > I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to > this version. > > Coprocessors with aggregate functions seem to be the thing I need. Thanks! > -- > Wojciech Langiewicz > > > Thanks, >> Himanshu >> >> >> On Sun, May 1, 2011 at 11:55 AM, Doug Meil<[EMAIL PROTECTED] >> >wrote: >> >> What caching value are you using on the scan? If you aren't setting >>> this, >>> it's probably using the default - which is 1. Which is slow. >>> http://hbase.apache.org/book.html#d379e3504 >>> >>> Re: "I would like to use HBase API, not MR job (because this cluster >>> only >>> has HDFS and HBase installed)." >>> >>> For Very Large tables you want to start using an MR job for this. >>> >>> >>> -----Original Message----- >>> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] >>> Sent: Sunday, May 01, 2011 9:44 AM >>> To: [EMAIL PROTECTED] >>> Subject: Row count without iterating over ResultScanner? >>> >>> Hi, >>> I would like to know if there's a way to quickly count number of rows >>> from >>> scan result? >>> Right now I'm iterating over ResultScanner like this: >>> int count = 0; >>> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { >>> ++count; >>> } >>> But with number of rows reaching millions this takes a while. >>> I tried to find something in documentation, but I didn't found anything. >>> I would like to use HBase API, not MR job (because this cluster only has >>> HDFS and HBase installed). >>> >>> Thanks for all help. >>> >>> -- >>> Wojciech Langiewicz >>> >>> >> >
-
RE: Row count without iterating over ResultScanner?Doug Meil 2011-05-01, 18:44
Another thing is be careful about CF/attributes you have in the Scan. If you add a column family (scan.addFamily) , it will pull *all* the attributes of that column family. If you only care about a row-count, pick only one very small attribute from the row.
-----Original Message----- From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] Sent: Sunday, May 01, 2011 2:12 PM To: [EMAIL PROTECTED] Subject: Re: Row count without iterating over ResultScanner? Yes, I was using default caching, setting this value to few thousands made significant difference in performance, I'll experiment more with this option. Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max). Thanks for the tip on caching! On 01.05.2011 19:55, Doug Meil wrote: > What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504 > > Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)." > > For Very Large tables you want to start using an MR job for this. > > > -----Original Message----- > From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] > Sent: Sunday, May 01, 2011 9:44 AM > To: [EMAIL PROTECTED] > Subject: Row count without iterating over ResultScanner? > > Hi, > I would like to know if there's a way to quickly count number of rows from scan result? > Right now I'm iterating over ResultScanner like this: > int count = 0; > for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { > ++count; > } > But with number of rows reaching millions this takes a while. > I tried to find something in documentation, but I didn't found anything. > I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed). > > Thanks for all help. > > -- > Wojciech Langiewicz
-
Re: Row count without iterating over ResultScanner?Wojciech Langiewicz 2011-05-01, 18:49
Thanks, also referring documentation from link you posted (13.6.5.) I
have applied those filters. On 01.05.2011 20:44, Doug Meil wrote: > Another thing is be careful about CF/attributes you have in the Scan. If you add a column family (scan.addFamily) , it will pull *all* the attributes of that column family. If you only care about a row-count, pick only one very small attribute from the row. > > > -----Original Message----- > From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] > Sent: Sunday, May 01, 2011 2:12 PM > To: [EMAIL PROTECTED] > Subject: Re: Row count without iterating over ResultScanner? > > Yes, I was using default caching, setting this value to few thousands made significant difference in performance, I'll experiment more with this option. > > Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max). > > Thanks for the tip on caching! > > On 01.05.2011 19:55, Doug Meil wrote: >> What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504 >> >> Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)." >> >> For Very Large tables you want to start using an MR job for this. >> >> >> -----Original Message----- >> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] >> Sent: Sunday, May 01, 2011 9:44 AM >> To: [EMAIL PROTECTED] >> Subject: Row count without iterating over ResultScanner? >> >> Hi, >> I would like to know if there's a way to quickly count number of rows from scan result? >> Right now I'm iterating over ResultScanner like this: >> int count = 0; >> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { >> ++count; >> } >> But with number of rows reaching millions this takes a while. >> I tried to find something in documentation, but I didn't found anything. >> I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed). >> >> Thanks for all help. >> >> -- >> Wojciech Langiewicz >
-
Re: Row count without iterating over ResultScanner?Wojciech Langiewicz 2011-05-01, 18:51
Thanks, that's great. But I firstly I have to update HBase and read some
documentation, so I'll let you know in a while how that works for me. On 01.05.2011 20:42, Himanshu Vashishtha wrote: > Yes, you can define your scan object at the client side and pass to the > AggregateClient.rowCount. You can refer to AggregateClient javadoc and > associated TestAggregateProtocol test methods to get an idea. > > Thanks, > Himanshu > > On Sun, May 1, 2011 at 12:29 PM, Wojciech Langiewicz > <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> On 01.05.2011 20:03, Himanshu Vashishtha wrote: >> >>> If you are interested row count only (and not want to fetch the table rows >>> to your client side), you can also try out >>> https://issues.apache.org/jira/browse/HBASE-1512. >>> >> >> Yes, I only want to count rows and apply filters or select columns. >> Are filters also supported to work with those aggregate functions? >> >> >> PS: Which version you are on? The above patch is in main trunk as of now, >>> so >>> to use it you would have to checkout the code and build it. >>> >> >> I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to >> this version. >> >> Coprocessors with aggregate functions seem to be the thing I need. Thanks! >> -- >> Wojciech Langiewicz >> >> >> Thanks, >>> Himanshu >>> >>> >>> On Sun, May 1, 2011 at 11:55 AM, Doug Meil<[EMAIL PROTECTED] >>>> wrote: >>> >>> What caching value are you using on the scan? If you aren't setting >>>> this, >>>> it's probably using the default - which is 1. Which is slow. >>>> http://hbase.apache.org/book.html#d379e3504 >>>> >>>> Re: "I would like to use HBase API, not MR job (because this cluster >>>> only >>>> has HDFS and HBase installed)." >>>> >>>> For Very Large tables you want to start using an MR job for this. >>>> >>>> >>>> -----Original Message----- >>>> From: Wojciech Langiewicz [mailto:[EMAIL PROTECTED]] >>>> Sent: Sunday, May 01, 2011 9:44 AM >>>> To: [EMAIL PROTECTED] >>>> Subject: Row count without iterating over ResultScanner? >>>> >>>> Hi, >>>> I would like to know if there's a way to quickly count number of rows >>>> from >>>> scan result? >>>> Right now I'm iterating over ResultScanner like this: >>>> int count = 0; >>>> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) { >>>> ++count; >>>> } >>>> But with number of rows reaching millions this takes a while. >>>> I tried to find something in documentation, but I didn't found anything. >>>> I would like to use HBase API, not MR job (because this cluster only has >>>> HDFS and HBase installed). >>>> >>>> Thanks for all help. >>>> >>>> -- >>>> Wojciech Langiewicz >>>> >>>> >>> >> > |