Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Optimizing Multi Gets in hbase


+
Varun Sharma 2013-02-18, 09:57
+
Anoop Sam John 2013-02-18, 10:49
+
Viral Bajaria 2013-02-18, 10:49
+
Nicolas Liochon 2013-02-18, 10:56
+
ramkrishna vasudevan 2013-02-18, 11:07
+
Michael Segel 2013-02-18, 12:52
+
lars hofhansl 2013-02-19, 01:48
+
Varun Sharma 2013-02-19, 06:45
+
lars hofhansl 2013-02-19, 08:02
+
Nicolas Liochon 2013-02-19, 08:37
+
Varun Sharma 2013-02-19, 15:52
Copy link to this message
-
Re: Optimizing Multi Gets in hbase
Nicolas Liochon 2013-02-19, 17:28
Imho,  the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next
row (SEEK_NEXT_USING_HINT).
The main drawback I see is that the filter will be invoked on all regions
servers, including the ones that don't need it. But this would also means
you have a very specific query pattern (which could be the case, I just
don't know), and you can still use the startRow / stopRow of the scan, and
create multiple scan if necessary. I'm also interested in Lars' opinion on
this.

Nicolas

On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:

> I have another question, if I am running a scan wrapped around multiple
> rows in the same region, in the following way:
>
> Scan scan = new scan(getWithMultipleRowsInSameRegion);
>
> Now, how does execution occur. Is it just a sequential scan across the
> entire region or does it seek to hfile blocks containing the actual values.
> What I truly mean is, lets say the multi get is on following rows:
>
> Row1 : HFileBlock1
> Row2 : HFileBlock20
> Row3 : Does not exist
> Row4 : HFileBlock25
> Row5 : HFileBlock100
>
> The efficient way to do this would be to determine the correct blocks using
> the index and then searching within the blocks for, say Row1. Then, seek to
> HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> seeking to + searching within HFileBlocks as needed.
>
> I am wondering if a scan wrapped around a Get with multiple rows would do
> the same ?
>
> Thanks
> Varun
>
> On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]>
> wrote:
>
> > Looking at the code, it seems possible to do this server side within the
> > multi invocation: we could group the get by region, and do a single scan.
> > We could also add some heuristics if necessary...
> >
> >
> >
> > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> >
> > > I should qualify that statement, actually.
> > >
> > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > returned.
> > >
> > > As James Taylor pointed out to me privately: A fairer comparison would
> > > have been to run a scan with a filter that lets x% of the rows pass
> (i.e.
> > > the selectivity of the scan would be x%) and compare that to a multi
> Get
> > of
> > > the same x% of the row.
> > >
> > > There we found that a Scan+Filter is more efficient that issuing multi
> > > Gets if x is >= 1-2%.
> > >
> > >
> > > Or in other words, translating many Gets into a Scan+Filter is
> beneficial
> > > if the Scan would return at least 1-2% of the rows to the client.
> > > For example:
> > > if you are looking for less than 10-20k rows in 1m rows, using muli
> Gets
> > > is likely more efficient.
> > > if you are looking for more than 10-20k rows in 1m rows, using a
> > > Scan+Filter is likely more efficient.
> > >
> > >
> > > Of course this is predicated on whether you have an efficient way to
> > > represent the rows you are looking for in a filter, so that would
> > probably
> > > shift this slightly more towards Gets (just imaging a Filter that to
> > encode
> > > 100k random row keys to be matched; since Filters are instantiated
> store
> > > there is another natural limit there).
> > >
> > >
> > > As I said below, the crux of the matter is having some histograms of
> your
> > > data, so that such a decision could be made automatically.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: lars hofhansl <[EMAIL PROTECTED]>
> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > > Sent: Monday, February 18, 2013 5:48 PM
> > > Subject: Re: Optimizing Multi Gets in hbase
> > >
> > > As it happens we did some tests around last week.
> > > Turns out doing Gets in batches instead of a scan still gives you 1/3
> of
> > > the performance.
> > >
> > > I.e. when you have a table with, say, 10m rows and scanning take N
> > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is
+
Varun Sharma 2013-02-19, 18:19
+
lars hofhansl 2013-02-19, 18:27
+
Nicolas Liochon 2013-02-19, 18:42
+
Nicolas Liochon 2013-02-19, 18:46