Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Optimizing Multi Gets in hbase


+
Varun Sharma 2013-02-18, 09:57
+
Anoop Sam John 2013-02-18, 10:49
+
Viral Bajaria 2013-02-18, 10:49
+
Nicolas Liochon 2013-02-18, 10:56
+
ramkrishna vasudevan 2013-02-18, 11:07
+
Michael Segel 2013-02-18, 12:52
+
lars hofhansl 2013-02-19, 01:48
+
Varun Sharma 2013-02-19, 06:45
+
lars hofhansl 2013-02-19, 08:02
Copy link to this message
-
Re: Optimizing Multi Gets in hbase
Looking at the code, it seems possible to do this server side within the
multi invocation: we could group the get by region, and do a single scan.
We could also add some heuristics if necessary...

On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> I should qualify that statement, actually.
>
> I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> returned.
>
> As James Taylor pointed out to me privately: A fairer comparison would
> have been to run a scan with a filter that lets x% of the rows pass (i.e.
> the selectivity of the scan would be x%) and compare that to a multi Get of
> the same x% of the row.
>
> There we found that a Scan+Filter is more efficient that issuing multi
> Gets if x is >= 1-2%.
>
>
> Or in other words, translating many Gets into a Scan+Filter is beneficial
> if the Scan would return at least 1-2% of the rows to the client.
> For example:
> if you are looking for less than 10-20k rows in 1m rows, using muli Gets
> is likely more efficient.
> if you are looking for more than 10-20k rows in 1m rows, using a
> Scan+Filter is likely more efficient.
>
>
> Of course this is predicated on whether you have an efficient way to
> represent the rows you are looking for in a filter, so that would probably
> shift this slightly more towards Gets (just imaging a Filter that to encode
> 100k random row keys to be matched; since Filters are instantiated store
> there is another natural limit there).
>
>
> As I said below, the crux of the matter is having some histograms of your
> data, so that such a decision could be made automatically.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: lars hofhansl <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Monday, February 18, 2013 5:48 PM
> Subject: Re: Optimizing Multi Gets in hbase
>
> As it happens we did some tests around last week.
> Turns out doing Gets in batches instead of a scan still gives you 1/3 of
> the performance.
>
> I.e. when you have a table with, say, 10m rows and scanning take N
> seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty
> impressive.
>
> Now, this is with all data in the cache!
> When the data is not in the cache and the Gets are random it is many
> orders of magnitude slower, as the Gets are sprayed all over the disk. In
> that case sorting the Gets and issuing scans would indeed be much more
> efficient.
>
>
> The Gets in a batch are already sorted on the client, but as N. says it is
> hard to determine when to turn many Gets into a Scan with filters
> automatically. Without statistics/histograms I'd even wager a guess that
> would be impossible to do.
> Imagine you issue 10000 random Gets, but your table has 10bn rows, in that
> case it is almost certain that the Gets are faster than a scan.
> Now image the Gets only cover a small key range. With statistics we could
> tell whether it would beneficial to turn this into a scan.
>
> It's not that hard to add statistics to HBase. Would do it as part of the
> compactions, and record the histograms in some table.
>
>
> You can always do that yourself. If you suspect you are touching most rows
> in a table/region, just issue a scan with a appropriate filter (may have to
> implement your own filter, though). Maybe we could a version of RowFilter
> that match against multiple keys.
>
>
> -- Lars
>
>
>
> ________________________________
> From: Varun Sharma <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, February 18, 2013 1:57 AM
> Subject: Optimizing Multi Gets in hbase
>
> Hi,
>
> I am trying to batched get(s) on a cluster. Here is the code:
>
> List<Get> gets = ...
> // Prepare my gets with the rows i need
> myHTable.get(gets);
>
> I have two questions about the above scenario:
> i) Is this the most optimal way to do this ?
> ii) I have a feeling that if there are multiple gets in this case, on the
> same region, then each one of those shall instantiate separate scan(s) over
+
Varun Sharma 2013-02-19, 15:52
+
Nicolas Liochon 2013-02-19, 17:28
+
Varun Sharma 2013-02-19, 18:19
+
lars hofhansl 2013-02-19, 18:27
+
Nicolas Liochon 2013-02-19, 18:42
+
Nicolas Liochon 2013-02-19, 18:46