Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Optimizing Multi Gets in hbase


+
Varun Sharma 2013-02-18, 09:57
+
Anoop Sam John 2013-02-18, 10:49
+
Viral Bajaria 2013-02-18, 10:49
+
Nicolas Liochon 2013-02-18, 10:56
+
ramkrishna vasudevan 2013-02-18, 11:07
+
Michael Segel 2013-02-18, 12:52
+
lars hofhansl 2013-02-19, 01:48
+
Varun Sharma 2013-02-19, 06:45
+
lars hofhansl 2013-02-19, 08:02
+
Nicolas Liochon 2013-02-19, 08:37
+
Varun Sharma 2013-02-19, 15:52
+
Nicolas Liochon 2013-02-19, 17:28
+
Varun Sharma 2013-02-19, 18:19
+
lars hofhansl 2013-02-19, 18:27
Copy link to this message
-
Re: Optimizing Multi Gets in hbase
Interesting, in the client we're doing a group by location the multiget.
So we could have the filter as HBase core code, and then we could use it in
the client for the multiget: compared to my initial proposal, we don't have
to change anything in the server code and we reuse the filtering framework.
The filter can be also be used independently.

Is there any issue with this? The reseek seems to be quite smart in the way
it handles the bloom filters, I don't know if it behaves differently in
this case vs. a simple get.
On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> I was thinking along the same lines. Doing a skip scan via filter hinting.
> The problem is as you say that the Filter is instantiated everywhere and it
> might be of significant size (have to maintain all row keys you are looking
> for).
>
>
> RegionScanner now a reseek method, it is possible to do this via a
> coprocessor. They are also loaded per region (but at least not for each
> store), and one can use the shared coproc state I added to alleviate the
> memory concern.
>
> Thinking about this in terms of multiple scan is interesting. One could
> identify clusters of close row keys in the Gets and issue a Scan for each
> cluster.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Nicolas Liochon <[EMAIL PROTECTED]>
> To: user <[EMAIL PROTECTED]>
> Sent: Tuesday, February 19, 2013 9:28 AM
> Subject: Re: Optimizing Multi Gets in hbase
>
> Imho,  the easiest thing to do would be to write a filter.
> You need to order the rows, then you can use hints to navigate to the next
> row (SEEK_NEXT_USING_HINT).
> The main drawback I see is that the filter will be invoked on all regions
> servers, including the ones that don't need it. But this would also means
> you have a very specific query pattern (which could be the case, I just
> don't know), and you can still use the startRow / stopRow of the scan, and
> create multiple scan if necessary. I'm also interested in Lars' opinion on
> this.
>
> Nicolas
>
>
>
> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> wrote:
>
> > I have another question, if I am running a scan wrapped around multiple
> > rows in the same region, in the following way:
> >
> > Scan scan = new scan(getWithMultipleRowsInSameRegion);
> >
> > Now, how does execution occur. Is it just a sequential scan across the
> > entire region or does it seek to hfile blocks containing the actual
> values.
> > What I truly mean is, lets say the multi get is on following rows:
> >
> > Row1 : HFileBlock1
> > Row2 : HFileBlock20
> > Row3 : Does not exist
> > Row4 : HFileBlock25
> > Row5 : HFileBlock100
> >
> > The efficient way to do this would be to determine the correct blocks
> using
> > the index and then searching within the blocks for, say Row1. Then, seek
> to
> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on
> > seeking to + searching within HFileBlocks as needed.
> >
> > I am wondering if a scan wrapped around a Get with multiple rows would do
> > the same ?
> >
> > Thanks
> > Varun
> >
> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Looking at the code, it seems possible to do this server side within
> the
> > > multi invocation: we could group the get by region, and do a single
> scan.
> > > We could also add some heuristics if necessary...
> > >
> > >
> > >
> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > I should qualify that statement, actually.
> > > >
> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are
> > > > returned.
> > > >
> > > > As James Taylor pointed out to me privately: A fairer comparison
> would
> > > > have been to run a scan with a filter that lets x% of the rows pass
> > (i.e.
> > > > the selectivity of the scan would be x%) and compare that to a multi
> > Get
> > > of
> > > > the same x% of the row.
> > > >
> > > > There we found that a Scan+Filter is more efficient that issuing
+
Nicolas Liochon 2013-02-19, 18:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB