Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Speeding up the row count


Copy link to this message
-
Re: Speeding up the row count
James Taylor 2013-04-19, 15:59
Phoenix will parallelize within a region:

SELECT count(1) FROM orders

I agree with Ted, though, even serially, 100,000 rows shouldn't take any where near 6 mins. You say > 100,000 rows. Can you tell us what it's < ?

Thanks,
James

On Apr 19, 2013, at 2:37 AM, "Ted Yu" <[EMAIL PROTECTED]> wrote:

> Since there is only one region in your table, using aggregation coprocessor has no advantage.
> I think there may be some issue with your cluster - row count should finish within 6 minutes.
>
> Have you checked server logs ?
>
> Thanks
>
> On Apr 19, 2013, at 12:33 AM, Omkar Joshi <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> I'm having a 2-node(VMs) Hadoop cluster atop which HBase is running in the distributed mode.
>>
>> I'm having a table named ORDERS with >100000 rows.
>>
>> NOTE : Since my cluster is ultra-small, I didn't pre-split the table.
>>
>> ORDERS
>> rowkey :                ORDER_ID
>>
>> column family : ORDER_DETAILS
>>       columns : CUSTOMER_ID
>>                       PRODUCT_ID
>>                       REQUEST_DATE
>>                       PRODUCT_QUANTITY
>>                       PRICE
>>                       PAYMENT_MODE
>>
>> The java client code to simply check the count of the records is :
>>
>> public long getTableCount(String tableName, String columnFamilyName) {
>>
>>               AggregationClient aggregationClient = new AggregationClient(config);
>>               Scan scan = new Scan();
>>               scan.addFamily(Bytes.toBytes(columnFamilyName));
>>               scan.setFilter(new FirstKeyOnlyFilter());
>>
>>               long rowCount = 0;
>>
>>               try {
>>                       rowCount = aggregationClient.rowCount(Bytes.toBytes(tableName),
>>                                       null, scan);
>>                       System.out.println("No. of rows in " + tableName + " is "
>>                                       + rowCount);
>>               } catch (Throwable e) {
>>                       // TODO Auto-generated catch block
>>                       e.printStackTrace();
>>               }
>>
>>               return rowCount;
>>       }
>>
>> It is running for more than 6 minutes now :(
>>
>> What shall I do to speed up the execution to milliseconds(at least a couple of seconds)?
>>
>> Regards,
>> Omkar Joshi
>>
>>
>> -----Original Message-----
>> From: Vedad Kirlic [mailto:[EMAIL PROTECTED]]
>> Sent: Thursday, April 18, 2013 12:22 AM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Speeding up the row count
>>
>> Hi Omkar,
>>
>> If you are not interested in occurrences of specific column (e.g. name,
>> email ... ), and just want to get total number of rows (regardless of their
>> content - i.e. columns), you should avoid adding any columns to the Scan, in
>> which case coprocessor implementation for AggregateClient, will add
>> FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so
>> this should result in some speed up.
>>
>> This is similar approach to what hbase shell 'count' implementation does,
>> although reduction in overhead in that case is bigger, since data transfer
>> from region server to client (shell) is minimized, whereas in case of
>> coprocessor, data does not leave region server, so most of the improvement
>> in that case should come from avoiding loading of unnecessary files. Not
>> sure how this will apply to your particular case, given that data set per
>> row seems to be rather small. Also, in case of AggregateClient you will
>> benefit if/when your tables span multiple regions. Essentially, performance
>> of this approach will 'degrade' as your table gets bigger, but only to the
>> point when it splits, from which point it should be pretty constant. Having
>> this in mind, and your type of data, you might consider pre-splitting your
>> tables.
>>
>> DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase
>> internals :), so your best bet is to try it - I'm too lazy to verify impact
>> my self ;)