Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Optimizing Multi Gets in hbase


Copy link to this message
-
Re: Optimizing Multi Gets in hbase
Nicolas Liochon 2013-02-19, 18:46
As well, an advantage of going only to the servers needed is the famous
MTTR: there are a less chance to go to a dead server or to a region that
just moved.
On Tue, Feb 19, 2013 at 7:42 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote:

> Interesting, in the client we're doing a group by location the multiget.
> So we could have the filter as HBase core code, and then we could use it
> in the client for the multiget: compared to my initial proposal, we don't
> have to change anything in the server code and we reuse the filtering
> framework. The filter can be also be used independently.
>
> Is there any issue with this? The reseek seems to be quite smart in the
> way it handles the bloom filters, I don't know if it behaves differently in
> this case vs. a simple get.
>
>
> On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
>> I was thinking along the same lines. Doing a skip scan via filter
>> hinting. The problem is as you say that the Filter is instantiated
>> everywhere and it might be of significant size (have to maintain all row
>> keys you are looking for).
>>
>>
>> RegionScanner now a reseek method, it is possible to do this via a
>> coprocessor. They are also loaded per region (but at least not for each
>> store), and one can use the shared coproc state I added to alleviate the
>> memory concern.
>>
>> Thinking about this in terms of multiple scan is interesting. One could
>> identify clusters of close row keys in the Gets and issue a Scan for each
>> cluster.
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Nicolas Liochon <[EMAIL PROTECTED]>
>> To: user <[EMAIL PROTECTED]>
>> Sent: Tuesday, February 19, 2013 9:28 AM
>> Subject: Re: Optimizing Multi Gets in hbase
>>
>> Imho,  the easiest thing to do would be to write a filter.
>> You need to order the rows, then you can use hints to navigate to the next
>> row (SEEK_NEXT_USING_HINT).
>> The main drawback I see is that the filter will be invoked on all regions
>> servers, including the ones that don't need it. But this would also means
>> you have a very specific query pattern (which could be the case, I just
>> don't know), and you can still use the startRow / stopRow of the scan, and
>> create multiple scan if necessary. I'm also interested in Lars' opinion on
>> this.
>>
>> Nicolas
>>
>>
>>
>> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]>
>> wrote:
>>
>> > I have another question, if I am running a scan wrapped around multiple
>> > rows in the same region, in the following way:
>> >
>> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
>> >
>> > Now, how does execution occur. Is it just a sequential scan across the
>> > entire region or does it seek to hfile blocks containing the actual
>> values.
>> > What I truly mean is, lets say the multi get is on following rows:
>> >
>> > Row1 : HFileBlock1
>> > Row2 : HFileBlock20
>> > Row3 : Does not exist
>> > Row4 : HFileBlock25
>> > Row5 : HFileBlock100
>> >
>> > The efficient way to do this would be to determine the correct blocks
>> using
>> > the index and then searching within the blocks for, say Row1. Then,
>> seek to
>> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
>> > seeking to + searching within HFileBlocks as needed.
>> >
>> > I am wondering if a scan wrapped around a Get with multiple rows would
>> do
>> > the same ?
>> >
>> > Thanks
>> > Varun
>> >
>> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]>
>> > wrote:
>> >
>> > > Looking at the code, it seems possible to do this server side within
>> the
>> > > multi invocation: we could group the get by region, and do a single
>> scan.
>> > > We could also add some heuristics if necessary...
>> > >
>> > >
>> > >
>> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]>
>> wrote:
>> > >
>> > > > I should qualify that statement, actually.
>> > > >
>> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
>> > > > returned.