Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Speeding up the row count


+
Omkar Joshi 2013-04-17, 09:47
+
Jean-Marc Spaggiari 2013-04-17, 11:06
+
Vedad Kirlic 2013-04-17, 18:52
+
Omkar Joshi 2013-04-19, 07:33
Copy link to this message
-
Re: Speeding up the row count
Since there is only one region in your table, using aggregation coprocessor has no advantage.
I think there may be some issue with your cluster - row count should finish within 6 minutes.

Have you checked server logs ?

Thanks

On Apr 19, 2013, at 12:33 AM, Omkar Joshi <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I'm having a 2-node(VMs) Hadoop cluster atop which HBase is running in the distributed mode.
>
> I'm having a table named ORDERS with >100000 rows.
>
> NOTE : Since my cluster is ultra-small, I didn't pre-split the table.
>
> ORDERS
> rowkey :                ORDER_ID
>
> column family : ORDER_DETAILS
>        columns : CUSTOMER_ID
>                        PRODUCT_ID
>                        REQUEST_DATE
>                        PRODUCT_QUANTITY
>                        PRICE
>                        PAYMENT_MODE
>
> The java client code to simply check the count of the records is :
>
> public long getTableCount(String tableName, String columnFamilyName) {
>
>                AggregationClient aggregationClient = new AggregationClient(config);
>                Scan scan = new Scan();
>                scan.addFamily(Bytes.toBytes(columnFamilyName));
>                scan.setFilter(new FirstKeyOnlyFilter());
>
>                long rowCount = 0;
>
>                try {
>                        rowCount = aggregationClient.rowCount(Bytes.toBytes(tableName),
>                                        null, scan);
>                        System.out.println("No. of rows in " + tableName + " is "
>                                        + rowCount);
>                } catch (Throwable e) {
>                        // TODO Auto-generated catch block
>                        e.printStackTrace();
>                }
>
>                return rowCount;
>        }
>
> It is running for more than 6 minutes now :(
>
> What shall I do to speed up the execution to milliseconds(at least a couple of seconds)?
>
> Regards,
> Omkar Joshi
>
>
> -----Original Message-----
> From: Vedad Kirlic [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, April 18, 2013 12:22 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Speeding up the row count
>
> Hi Omkar,
>
> If you are not interested in occurrences of specific column (e.g. name,
> email ... ), and just want to get total number of rows (regardless of their
> content - i.e. columns), you should avoid adding any columns to the Scan, in
> which case coprocessor implementation for AggregateClient, will add
> FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so
> this should result in some speed up.
>
> This is similar approach to what hbase shell 'count' implementation does,
> although reduction in overhead in that case is bigger, since data transfer
> from region server to client (shell) is minimized, whereas in case of
> coprocessor, data does not leave region server, so most of the improvement
> in that case should come from avoiding loading of unnecessary files. Not
> sure how this will apply to your particular case, given that data set per
> row seems to be rather small. Also, in case of AggregateClient you will
> benefit if/when your tables span multiple regions. Essentially, performance
> of this approach will 'degrade' as your table gets bigger, but only to the
> point when it splits, from which point it should be pretty constant. Having
> this in mind, and your type of data, you might consider pre-splitting your
> tables.
>
> DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase
> internals :), so your best bet is to try it - I'm too lazy to verify impact
> my self ;)
>
> Finally, if your case can tolerate eventual consistency of counters with
> actual number of rows, you can, as already suggested, have RowCounter map
> reduce run every once in a while, write the counter(s) back to hbase, and
> read those when you need to obtain the number of rows.
>
> Regards,
> Vedad
>
>
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html
+
Omkar Joshi 2013-04-19, 09:55
+
Ted Yu 2013-04-19, 13:55
+
Omkar Joshi 2013-04-22, 06:39
+
lars hofhansl 2013-04-19, 18:10
+
James Taylor 2013-04-19, 15:59
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB