Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Optimizing Multi Gets in hbase


+
Varun Sharma 2013-02-18, 09:57
+
Anoop Sam John 2013-02-18, 10:49
+
Viral Bajaria 2013-02-18, 10:49
+
Nicolas Liochon 2013-02-18, 10:56
+
ramkrishna vasudevan 2013-02-18, 11:07
+
Michael Segel 2013-02-18, 12:52
Copy link to this message
-
Re: Optimizing Multi Gets in hbase
As it happens we did some tests around last week.
Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance.

I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive.

Now, this is with all data in the cache!
When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient.
The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do.
Imagine you issue 10000 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan.
Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan.

It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table.
You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter, though). Maybe we could a version of RowFilter that match against multiple keys.
-- Lars

________________________________
 From: Varun Sharma <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, February 18, 2013 1:57 AM
Subject: Optimizing Multi Gets in hbase
 
Hi,

I am trying to batched get(s) on a cluster. Here is the code:

List<Get> gets = ...
// Prepare my gets with the rows i need
myHTable.get(gets);

I have two questions about the above scenario:
i) Is this the most optimal way to do this ?
ii) I have a feeling that if there are multiple gets in this case, on the
same region, then each one of those shall instantiate separate scan(s) over
the region even though a single scan is sufficient. Am I mistaken here ?

Thanks
Varun
+
Varun Sharma 2013-02-19, 06:45
+
lars hofhansl 2013-02-19, 08:02
+
Nicolas Liochon 2013-02-19, 08:37
+
Varun Sharma 2013-02-19, 15:52
+
Nicolas Liochon 2013-02-19, 17:28
+
Varun Sharma 2013-02-19, 18:19
+
lars hofhansl 2013-02-19, 18:27
+
Nicolas Liochon 2013-02-19, 18:42
+
Nicolas Liochon 2013-02-19, 18:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB