|
Varun Sharma
2013-02-18, 09:57
Anoop Sam John
2013-02-18, 10:49
Viral Bajaria
2013-02-18, 10:49
Nicolas Liochon
2013-02-18, 10:56
ramkrishna vasudevan
2013-02-18, 11:07
Michael Segel
2013-02-18, 12:52
lars hofhansl
2013-02-19, 01:48
Varun Sharma
2013-02-19, 06:45
lars hofhansl
2013-02-19, 08:02
Nicolas Liochon
2013-02-19, 08:37
Varun Sharma
2013-02-19, 15:52
Nicolas Liochon
2013-02-19, 17:28
Varun Sharma
2013-02-19, 18:19
lars hofhansl
2013-02-19, 18:27
Nicolas Liochon
2013-02-19, 18:42
Nicolas Liochon
2013-02-19, 18:46
|
-
Optimizing Multi Gets in hbaseVarun Sharma 2013-02-18, 09:57
Hi,
I am trying to batched get(s) on a cluster. Here is the code: List<Get> gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is sufficient. Am I mistaken here ? Thanks Varun +
Varun Sharma 2013-02-18, 09:57
-
RE: Optimizing Multi Gets in hbaseAnoop Sam John 2013-02-18, 10:49
It will instantiate one scan op per Get
-Anoop- ________________________________________ From: Varun Sharma [[EMAIL PROTECTED]] Sent: Monday, February 18, 2013 3:27 PM To: [EMAIL PROTECTED] Subject: Optimizing Multi Gets in hbase Hi, I am trying to batched get(s) on a cluster. Here is the code: List<Get> gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is sufficient. Am I mistaken here ? Thanks Varun +
Anoop Sam John 2013-02-18, 10:49
-
Re: Optimizing Multi Gets in hbaseViral Bajaria 2013-02-18, 10:49
Hi Varun,
Are your gets around sequential keys ? If so, you might benefit by doing scans with a start and stop. If they are not sequential I don't think there would be a better way from the way you describe the problem. Besides that, some of the questions that come to mind: - How many GET(s) are you issuing simultaneously ? - Are they hitting the same region and hotspotting it ? - Are these GET(s) on the same rowkey but trying to get different column families ? Thanks, Viral On Mon, Feb 18, 2013 at 1:57 AM, Varun Sharma <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to batched get(s) on a cluster. Here is the code: > > List<Get> gets = ... > // Prepare my gets with the rows i need > myHTable.get(gets); > > I have two questions about the above scenario: > i) Is this the most optimal way to do this ? > ii) I have a feeling that if there are multiple gets in this case, on the > same region, then each one of those shall instantiate separate scan(s) over > the region even though a single scan is sufficient. Am I mistaken here ? > > Thanks > Varun > +
Viral Bajaria 2013-02-18, 10:49
-
Re: Optimizing Multi Gets in hbaseNicolas Liochon 2013-02-18, 10:56
i) Yes, or, at least, of often yes.
II) You're right. It's difficult to guess how much it would improve the performances (there is a lot of caching effect), but using a single scan could be an interesting optimisation imho. Nicolas On Mon, Feb 18, 2013 at 10:57 AM, Varun Sharma <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to batched get(s) on a cluster. Here is the code: > > List<Get> gets = ... > // Prepare my gets with the rows i need > myHTable.get(gets); > > I have two questions about the above scenario: > i) Is this the most optimal way to do this ? > ii) I have a feeling that if there are multiple gets in this case, on the > same region, then each one of those shall instantiate separate scan(s) over > the region even though a single scan is sufficient. Am I mistaken here ? > > Thanks > Varun > +
Nicolas Liochon 2013-02-18, 10:56
-
Re: Optimizing Multi Gets in hbaseramkrishna vasudevan 2013-02-18, 11:07
If the scan is happening on the same region then going for Scan would be a
better option. Regards RAm On Mon, Feb 18, 2013 at 4:26 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > i) Yes, or, at least, of often yes. > II) You're right. It's difficult to guess how much it would improve the > performances (there is a lot of caching effect), but using a single scan > could be an interesting optimisation imho. > > Nicolas > > > On Mon, Feb 18, 2013 at 10:57 AM, Varun Sharma <[EMAIL PROTECTED]> > wrote: > > > Hi, > > > > I am trying to batched get(s) on a cluster. Here is the code: > > > > List<Get> gets = ... > > // Prepare my gets with the rows i need > > myHTable.get(gets); > > > > I have two questions about the above scenario: > > i) Is this the most optimal way to do this ? > > ii) I have a feeling that if there are multiple gets in this case, on the > > same region, then each one of those shall instantiate separate scan(s) > over > > the region even though a single scan is sufficient. Am I mistaken here ? > > > > Thanks > > Varun > > > +
ramkrishna vasudevan 2013-02-18, 11:07
-
Re: Optimizing Multi Gets in hbaseMichael Segel 2013-02-18, 12:52
So you'd have to do a little bit of home work up front.
Supposed you have to pull some data from 30K rows out of 10 Mil? If they are in sort order, you could determine the regions and then think about doing a couple of scans in parallel. But that may be more work than just doing the set of gets. It would be interesting to benchmark the performance.... I wonder if a coprocessor could help speed this up? I mean use the cp to do all the gets per region rather than a full region scan and then filter against the list for that region. Again this would be for a very specific type of query.... On Feb 18, 2013, at 5:07 AM, ramkrishna vasudevan <[EMAIL PROTECTED]> wrote: > If the scan is happening on the same region then going for Scan would be a > better option. > > Regards > RAm > > On Mon, Feb 18, 2013 at 4:26 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > >> i) Yes, or, at least, of often yes. >> II) You're right. It's difficult to guess how much it would improve the >> performances (there is a lot of caching effect), but using a single scan >> could be an interesting optimisation imho. >> >> Nicolas >> >> >> On Mon, Feb 18, 2013 at 10:57 AM, Varun Sharma <[EMAIL PROTECTED]> >> wrote: >> >>> Hi, >>> >>> I am trying to batched get(s) on a cluster. Here is the code: >>> >>> List<Get> gets = ... >>> // Prepare my gets with the rows i need >>> myHTable.get(gets); >>> >>> I have two questions about the above scenario: >>> i) Is this the most optimal way to do this ? >>> ii) I have a feeling that if there are multiple gets in this case, on the >>> same region, then each one of those shall instantiate separate scan(s) >> over >>> the region even though a single scan is sufficient. Am I mistaken here ? >>> >>> Thanks >>> Varun >>> >> +
Michael Segel 2013-02-18, 12:52
-
Re: Optimizing Multi Gets in hbaselars hofhansl 2013-02-19, 01:48
As it happens we did some tests around last week.
Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 10000 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan. Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan. It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table. You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter, though). Maybe we could a version of RowFilter that match against multiple keys. -- Lars ________________________________ From: Varun Sharma <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, February 18, 2013 1:57 AM Subject: Optimizing Multi Gets in hbase Hi, I am trying to batched get(s) on a cluster. Here is the code: List<Get> gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is sufficient. Am I mistaken here ? Thanks Varun +
lars hofhansl 2013-02-19, 01:48
-
Re: Optimizing Multi Gets in hbaseVarun Sharma 2013-02-19, 06:45
I am actually more concerned about multiple gets within a region. I think
if random rows within a region are accessed, it should always be one scan instead of doing one scan per get (just like we do for the BulkDeleteEndpoint). Wouldn't that always be faster ? On Mon, Feb 18, 2013 at 5:48 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > As it happens we did some tests around last week. > Turns out doing Gets in batches instead of a scan still gives you 1/3 of > the performance. > > I.e. when you have a table with, say, 10m rows and scanning take N > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty > impressive. > > Now, this is with all data in the cache! > When the data is not in the cache and the Gets are random it is many > orders of magnitude slower, as the Gets are sprayed all over the disk. In > that case sorting the Gets and issuing scans would indeed be much more > efficient. > > > The Gets in a batch are already sorted on the client, but as N. says it is > hard to determine when to turn many Gets into a Scan with filters > automatically. Without statistics/histograms I'd even wager a guess that > would be impossible to do. > Imagine you issue 10000 random Gets, but your table has 10bn rows, in that > case it is almost certain that the Gets are faster than a scan. > Now image the Gets only cover a small key range. With statistics we could > tell whether it would beneficial to turn this into a scan. > > It's not that hard to add statistics to HBase. Would do it as part of the > compactions, and record the histograms in some table. > > > You can always do that yourself. If you suspect you are touching most rows > in a table/region, just issue a scan with a appropriate filter (may have to > implement your own filter, though). Maybe we could a version of RowFilter > that match against multiple keys. > > > -- Lars > > > > ________________________________ > From: Varun Sharma <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, February 18, 2013 1:57 AM > Subject: Optimizing Multi Gets in hbase > > Hi, > > I am trying to batched get(s) on a cluster. Here is the code: > > List<Get> gets = ... > // Prepare my gets with the rows i need > myHTable.get(gets); > > I have two questions about the above scenario: > i) Is this the most optimal way to do this ? > ii) I have a feeling that if there are multiple gets in this case, on the > same region, then each one of those shall instantiate separate scan(s) over > the region even though a single scan is sufficient. Am I mistaken here ? > > Thanks > Varun > +
Varun Sharma 2013-02-19, 06:45
-
Re: Optimizing Multi Gets in hbaselars hofhansl 2013-02-19, 08:02
I should qualify that statement, actually.
I was comparing scanning 1m KVs to getting 1m KVs when all KVs are returned. As James Taylor pointed out to me privately: A fairer comparison would have been to run a scan with a filter that lets x% of the rows pass (i.e. the selectivity of the scan would be x%) and compare that to a multi Get of the same x% of the row. There we found that a Scan+Filter is more efficient that issuing multi Gets if x is >= 1-2%. Or in other words, translating many Gets into a Scan+Filter is beneficial if the Scan would return at least 1-2% of the rows to the client. For example: if you are looking for less than 10-20k rows in 1m rows, using muli Gets is likely more efficient. if you are looking for more than 10-20k rows in 1m rows, using a Scan+Filter is likely more efficient. Of course this is predicated on whether you have an efficient way to represent the rows you are looking for in a filter, so that would probably shift this slightly more towards Gets (just imaging a Filter that to encode 100k random row keys to be matched; since Filters are instantiated store there is another natural limit there). As I said below, the crux of the matter is having some histograms of your data, so that such a decision could be made automatically. -- Lars ________________________________ From: lars hofhansl <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Monday, February 18, 2013 5:48 PM Subject: Re: Optimizing Multi Gets in hbase As it happens we did some tests around last week. Turns out doing Gets in batches instead of a scan still gives you 1/3 of the performance. I.e. when you have a table with, say, 10m rows and scanning take N seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty impressive. Now, this is with all data in the cache! When the data is not in the cache and the Gets are random it is many orders of magnitude slower, as the Gets are sprayed all over the disk. In that case sorting the Gets and issuing scans would indeed be much more efficient. The Gets in a batch are already sorted on the client, but as N. says it is hard to determine when to turn many Gets into a Scan with filters automatically. Without statistics/histograms I'd even wager a guess that would be impossible to do. Imagine you issue 10000 random Gets, but your table has 10bn rows, in that case it is almost certain that the Gets are faster than a scan. Now image the Gets only cover a small key range. With statistics we could tell whether it would beneficial to turn this into a scan. It's not that hard to add statistics to HBase. Would do it as part of the compactions, and record the histograms in some table. You can always do that yourself. If you suspect you are touching most rows in a table/region, just issue a scan with a appropriate filter (may have to implement your own filter, though). Maybe we could a version of RowFilter that match against multiple keys. -- Lars ________________________________ From: Varun Sharma <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, February 18, 2013 1:57 AM Subject: Optimizing Multi Gets in hbase Hi, I am trying to batched get(s) on a cluster. Here is the code: List<Get> gets = ... // Prepare my gets with the rows i need myHTable.get(gets); I have two questions about the above scenario: i) Is this the most optimal way to do this ? ii) I have a feeling that if there are multiple gets in this case, on the same region, then each one of those shall instantiate separate scan(s) over the region even though a single scan is sufficient. Am I mistaken here ? Thanks Varun +
lars hofhansl 2013-02-19, 08:02
-
Re: Optimizing Multi Gets in hbaseNicolas Liochon 2013-02-19, 08:37
Looking at the code, it seems possible to do this server side within the
multi invocation: we could group the get by region, and do a single scan. We could also add some heuristics if necessary... On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > I should qualify that statement, actually. > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > returned. > > As James Taylor pointed out to me privately: A fairer comparison would > have been to run a scan with a filter that lets x% of the rows pass (i.e. > the selectivity of the scan would be x%) and compare that to a multi Get of > the same x% of the row. > > There we found that a Scan+Filter is more efficient that issuing multi > Gets if x is >= 1-2%. > > > Or in other words, translating many Gets into a Scan+Filter is beneficial > if the Scan would return at least 1-2% of the rows to the client. > For example: > if you are looking for less than 10-20k rows in 1m rows, using muli Gets > is likely more efficient. > if you are looking for more than 10-20k rows in 1m rows, using a > Scan+Filter is likely more efficient. > > > Of course this is predicated on whether you have an efficient way to > represent the rows you are looking for in a filter, so that would probably > shift this slightly more towards Gets (just imaging a Filter that to encode > 100k random row keys to be matched; since Filters are instantiated store > there is another natural limit there). > > > As I said below, the crux of the matter is having some histograms of your > data, so that such a decision could be made automatically. > > > -- Lars > > > > ________________________________ > From: lars hofhansl <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Monday, February 18, 2013 5:48 PM > Subject: Re: Optimizing Multi Gets in hbase > > As it happens we did some tests around last week. > Turns out doing Gets in batches instead of a scan still gives you 1/3 of > the performance. > > I.e. when you have a table with, say, 10m rows and scanning take N > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is pretty > impressive. > > Now, this is with all data in the cache! > When the data is not in the cache and the Gets are random it is many > orders of magnitude slower, as the Gets are sprayed all over the disk. In > that case sorting the Gets and issuing scans would indeed be much more > efficient. > > > The Gets in a batch are already sorted on the client, but as N. says it is > hard to determine when to turn many Gets into a Scan with filters > automatically. Without statistics/histograms I'd even wager a guess that > would be impossible to do. > Imagine you issue 10000 random Gets, but your table has 10bn rows, in that > case it is almost certain that the Gets are faster than a scan. > Now image the Gets only cover a small key range. With statistics we could > tell whether it would beneficial to turn this into a scan. > > It's not that hard to add statistics to HBase. Would do it as part of the > compactions, and record the histograms in some table. > > > You can always do that yourself. If you suspect you are touching most rows > in a table/region, just issue a scan with a appropriate filter (may have to > implement your own filter, though). Maybe we could a version of RowFilter > that match against multiple keys. > > > -- Lars > > > > ________________________________ > From: Varun Sharma <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, February 18, 2013 1:57 AM > Subject: Optimizing Multi Gets in hbase > > Hi, > > I am trying to batched get(s) on a cluster. Here is the code: > > List<Get> gets = ... > // Prepare my gets with the rows i need > myHTable.get(gets); > > I have two questions about the above scenario: > i) Is this the most optimal way to do this ? > ii) I have a feeling that if there are multiple gets in this case, on the > same region, then each one of those shall instantiate separate scan(s) over +
Nicolas Liochon 2013-02-19, 08:37
-
Re: Optimizing Multi Gets in hbaseVarun Sharma 2013-02-19, 15:52
I have another question, if I am running a scan wrapped around multiple
rows in the same region, in the following way: Scan scan = new scan(getWithMultipleRowsInSameRegion); Now, how does execution occur. Is it just a sequential scan across the entire region or does it seek to hfile blocks containing the actual values. What I truly mean is, lets say the multi get is on following rows: Row1 : HFileBlock1 Row2 : HFileBlock20 Row3 : Does not exist Row4 : HFileBlock25 Row5 : HFileBlock100 The efficient way to do this would be to determine the correct blocks using the index and then searching within the blocks for, say Row1. Then, seek to HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on seeking to + searching within HFileBlocks as needed. I am wondering if a scan wrapped around a Get with multiple rows would do the same ? Thanks Varun On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > Looking at the code, it seems possible to do this server side within the > multi invocation: we could group the get by region, and do a single scan. > We could also add some heuristics if necessary... > > > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > I should qualify that statement, actually. > > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > > returned. > > > > As James Taylor pointed out to me privately: A fairer comparison would > > have been to run a scan with a filter that lets x% of the rows pass (i.e. > > the selectivity of the scan would be x%) and compare that to a multi Get > of > > the same x% of the row. > > > > There we found that a Scan+Filter is more efficient that issuing multi > > Gets if x is >= 1-2%. > > > > > > Or in other words, translating many Gets into a Scan+Filter is beneficial > > if the Scan would return at least 1-2% of the rows to the client. > > For example: > > if you are looking for less than 10-20k rows in 1m rows, using muli Gets > > is likely more efficient. > > if you are looking for more than 10-20k rows in 1m rows, using a > > Scan+Filter is likely more efficient. > > > > > > Of course this is predicated on whether you have an efficient way to > > represent the rows you are looking for in a filter, so that would > probably > > shift this slightly more towards Gets (just imaging a Filter that to > encode > > 100k random row keys to be matched; since Filters are instantiated store > > there is another natural limit there). > > > > > > As I said below, the crux of the matter is having some histograms of your > > data, so that such a decision could be made automatically. > > > > > > -- Lars > > > > > > > > ________________________________ > > From: lars hofhansl <[EMAIL PROTECTED]> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > Sent: Monday, February 18, 2013 5:48 PM > > Subject: Re: Optimizing Multi Gets in hbase > > > > As it happens we did some tests around last week. > > Turns out doing Gets in batches instead of a scan still gives you 1/3 of > > the performance. > > > > I.e. when you have a table with, say, 10m rows and scanning take N > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is > pretty > > impressive. > > > > Now, this is with all data in the cache! > > When the data is not in the cache and the Gets are random it is many > > orders of magnitude slower, as the Gets are sprayed all over the disk. In > > that case sorting the Gets and issuing scans would indeed be much more > > efficient. > > > > > > The Gets in a batch are already sorted on the client, but as N. says it > is > > hard to determine when to turn many Gets into a Scan with filters > > automatically. Without statistics/histograms I'd even wager a guess that > > would be impossible to do. > > Imagine you issue 10000 random Gets, but your table has 10bn rows, in > that > > case it is almost certain that the Gets are faster than a scan. > > Now image the Gets only cover a small key range. With statistics we could +
Varun Sharma 2013-02-19, 15:52
-
Re: Optimizing Multi Gets in hbaseNicolas Liochon 2013-02-19, 17:28
Imho, the easiest thing to do would be to write a filter.
You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> wrote: > I have another question, if I am running a scan wrapped around multiple > rows in the same region, in the following way: > > Scan scan = new scan(getWithMultipleRowsInSameRegion); > > Now, how does execution occur. Is it just a sequential scan across the > entire region or does it seek to hfile blocks containing the actual values. > What I truly mean is, lets say the multi get is on following rows: > > Row1 : HFileBlock1 > Row2 : HFileBlock20 > Row3 : Does not exist > Row4 : HFileBlock25 > Row5 : HFileBlock100 > > The efficient way to do this would be to determine the correct blocks using > the index and then searching within the blocks for, say Row1. Then, seek to > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on > seeking to + searching within HFileBlocks as needed. > > I am wondering if a scan wrapped around a Get with multiple rows would do > the same ? > > Thanks > Varun > > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]> > wrote: > > > Looking at the code, it seems possible to do this server side within the > > multi invocation: we could group the get by region, and do a single scan. > > We could also add some heuristics if necessary... > > > > > > > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > > > I should qualify that statement, actually. > > > > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > > > returned. > > > > > > As James Taylor pointed out to me privately: A fairer comparison would > > > have been to run a scan with a filter that lets x% of the rows pass > (i.e. > > > the selectivity of the scan would be x%) and compare that to a multi > Get > > of > > > the same x% of the row. > > > > > > There we found that a Scan+Filter is more efficient that issuing multi > > > Gets if x is >= 1-2%. > > > > > > > > > Or in other words, translating many Gets into a Scan+Filter is > beneficial > > > if the Scan would return at least 1-2% of the rows to the client. > > > For example: > > > if you are looking for less than 10-20k rows in 1m rows, using muli > Gets > > > is likely more efficient. > > > if you are looking for more than 10-20k rows in 1m rows, using a > > > Scan+Filter is likely more efficient. > > > > > > > > > Of course this is predicated on whether you have an efficient way to > > > represent the rows you are looking for in a filter, so that would > > probably > > > shift this slightly more towards Gets (just imaging a Filter that to > > encode > > > 100k random row keys to be matched; since Filters are instantiated > store > > > there is another natural limit there). > > > > > > > > > As I said below, the crux of the matter is having some histograms of > your > > > data, so that such a decision could be made automatically. > > > > > > > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: lars hofhansl <[EMAIL PROTECTED]> > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > > > Sent: Monday, February 18, 2013 5:48 PM > > > Subject: Re: Optimizing Multi Gets in hbase > > > > > > As it happens we did some tests around last week. > > > Turns out doing Gets in batches instead of a scan still gives you 1/3 > of > > > the performance. > > > > > > I.e. when you have a table with, say, 10m rows and scanning take N > > > seconds, then calling 10m Gets in batches of 1000 take ~3N, which is +
Nicolas Liochon 2013-02-19, 17:28
-
Re: Optimizing Multi Gets in hbaseVarun Sharma 2013-02-19, 18:19
The other suggestion, sounds better to me where the multi call is modified
to run the Get(s) with this new filter or just initiate a scan with all the get(s). Since the client automatically groups the multi calls by region server and only calls the respective region servers. That would eliminate calling all region servers. This might be an interesting benchmark to run. On Tue, Feb 19, 2013 at 9:28 AM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > Imho, the easiest thing to do would be to write a filter. > You need to order the rows, then you can use hints to navigate to the next > row (SEEK_NEXT_USING_HINT). > The main drawback I see is that the filter will be invoked on all regions > servers, including the ones that don't need it. But this would also means > you have a very specific query pattern (which could be the case, I just > don't know), and you can still use the startRow / stopRow of the scan, and > create multiple scan if necessary. I'm also interested in Lars' opinion on > this. > > Nicolas > > > > On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> wrote: > > > I have another question, if I am running a scan wrapped around multiple > > rows in the same region, in the following way: > > > > Scan scan = new scan(getWithMultipleRowsInSameRegion); > > > > Now, how does execution occur. Is it just a sequential scan across the > > entire region or does it seek to hfile blocks containing the actual > values. > > What I truly mean is, lets say the multi get is on following rows: > > > > Row1 : HFileBlock1 > > Row2 : HFileBlock20 > > Row3 : Does not exist > > Row4 : HFileBlock25 > > Row5 : HFileBlock100 > > > > The efficient way to do this would be to determine the correct blocks > using > > the index and then searching within the blocks for, say Row1. Then, seek > to > > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on > > seeking to + searching within HFileBlocks as needed. > > > > I am wondering if a scan wrapped around a Get with multiple rows would do > > the same ? > > > > Thanks > > Varun > > > > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]> > > wrote: > > > > > Looking at the code, it seems possible to do this server side within > the > > > multi invocation: we could group the get by region, and do a single > scan. > > > We could also add some heuristics if necessary... > > > > > > > > > > > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > > > > > I should qualify that statement, actually. > > > > > > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > > > > returned. > > > > > > > > As James Taylor pointed out to me privately: A fairer comparison > would > > > > have been to run a scan with a filter that lets x% of the rows pass > > (i.e. > > > > the selectivity of the scan would be x%) and compare that to a multi > > Get > > > of > > > > the same x% of the row. > > > > > > > > There we found that a Scan+Filter is more efficient that issuing > multi > > > > Gets if x is >= 1-2%. > > > > > > > > > > > > Or in other words, translating many Gets into a Scan+Filter is > > beneficial > > > > if the Scan would return at least 1-2% of the rows to the client. > > > > For example: > > > > if you are looking for less than 10-20k rows in 1m rows, using muli > > Gets > > > > is likely more efficient. > > > > if you are looking for more than 10-20k rows in 1m rows, using a > > > > Scan+Filter is likely more efficient. > > > > > > > > > > > > Of course this is predicated on whether you have an efficient way to > > > > represent the rows you are looking for in a filter, so that would > > > probably > > > > shift this slightly more towards Gets (just imaging a Filter that to > > > encode > > > > 100k random row keys to be matched; since Filters are instantiated > > store > > > > there is another natural limit there). > > > > > > > > > > > > As I said below, the crux of the matter is having some histograms of > > your +
Varun Sharma 2013-02-19, 18:19
-
Re: Optimizing Multi Gets in hbaselars hofhansl 2013-02-19, 18:27
I was thinking along the same lines. Doing a skip scan via filter hinting. The problem is as you say that the Filter is instantiated everywhere and it might be of significant size (have to maintain all row keys you are looking for).
RegionScanner now a reseek method, it is possible to do this via a coprocessor. They are also loaded per region (but at least not for each store), and one can use the shared coproc state I added to alleviate the memory concern. Thinking about this in terms of multiple scan is interesting. One could identify clusters of close row keys in the Gets and issue a Scan for each cluster. -- Lars ________________________________ From: Nicolas Liochon <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Tuesday, February 19, 2013 9:28 AM Subject: Re: Optimizing Multi Gets in hbase Imho, the easiest thing to do would be to write a filter. You need to order the rows, then you can use hints to navigate to the next row (SEEK_NEXT_USING_HINT). The main drawback I see is that the filter will be invoked on all regions servers, including the ones that don't need it. But this would also means you have a very specific query pattern (which could be the case, I just don't know), and you can still use the startRow / stopRow of the scan, and create multiple scan if necessary. I'm also interested in Lars' opinion on this. Nicolas On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> wrote: > I have another question, if I am running a scan wrapped around multiple > rows in the same region, in the following way: > > Scan scan = new scan(getWithMultipleRowsInSameRegion); > > Now, how does execution occur. Is it just a sequential scan across the > entire region or does it seek to hfile blocks containing the actual values. > What I truly mean is, lets say the multi get is on following rows: > > Row1 : HFileBlock1 > Row2 : HFileBlock20 > Row3 : Does not exist > Row4 : HFileBlock25 > Row5 : HFileBlock100 > > The efficient way to do this would be to determine the correct blocks using > the index and then searching within the blocks for, say Row1. Then, seek to > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on > seeking to + searching within HFileBlocks as needed. > > I am wondering if a scan wrapped around a Get with multiple rows would do > the same ? > > Thanks > Varun > > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]> > wrote: > > > Looking at the code, it seems possible to do this server side within the > > multi invocation: we could group the get by region, and do a single scan. > > We could also add some heuristics if necessary... > > > > > > > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > > > I should qualify that statement, actually. > > > > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > > > returned. > > > > > > As James Taylor pointed out to me privately: A fairer comparison would > > > have been to run a scan with a filter that lets x% of the rows pass > (i.e. > > > the selectivity of the scan would be x%) and compare that to a multi > Get > > of > > > the same x% of the row. > > > > > > There we found that a Scan+Filter is more efficient that issuing multi > > > Gets if x is >= 1-2%. > > > > > > > > > Or in other words, translating many Gets into a Scan+Filter is > beneficial > > > if the Scan would return at least 1-2% of the rows to the client. > > > For example: > > > if you are looking for less than 10-20k rows in 1m rows, using muli > Gets > > > is likely more efficient. > > > if you are looking for more than 10-20k rows in 1m rows, using a > > > Scan+Filter is likely more efficient. > > > > > > > > > Of course this is predicated on whether you have an efficient way to > > > represent the rows you are looking for in a filter, so that would > > probably > > > shift this slightly more towards Gets (just imaging a Filter that to > > encode > > > 100k random row keys to be matched; since Filters are instantiated +
lars hofhansl 2013-02-19, 18:27
-
Re: Optimizing Multi Gets in hbaseNicolas Liochon 2013-02-19, 18:42
Interesting, in the client we're doing a group by location the multiget.
So we could have the filter as HBase core code, and then we could use it in the client for the multiget: compared to my initial proposal, we don't have to change anything in the server code and we reuse the filtering framework. The filter can be also be used independently. Is there any issue with this? The reseek seems to be quite smart in the way it handles the bloom filters, I don't know if it behaves differently in this case vs. a simple get. On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > I was thinking along the same lines. Doing a skip scan via filter hinting. > The problem is as you say that the Filter is instantiated everywhere and it > might be of significant size (have to maintain all row keys you are looking > for). > > > RegionScanner now a reseek method, it is possible to do this via a > coprocessor. They are also loaded per region (but at least not for each > store), and one can use the shared coproc state I added to alleviate the > memory concern. > > Thinking about this in terms of multiple scan is interesting. One could > identify clusters of close row keys in the Gets and issue a Scan for each > cluster. > > > -- Lars > > > > ________________________________ > From: Nicolas Liochon <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Tuesday, February 19, 2013 9:28 AM > Subject: Re: Optimizing Multi Gets in hbase > > Imho, the easiest thing to do would be to write a filter. > You need to order the rows, then you can use hints to navigate to the next > row (SEEK_NEXT_USING_HINT). > The main drawback I see is that the filter will be invoked on all regions > servers, including the ones that don't need it. But this would also means > you have a very specific query pattern (which could be the case, I just > don't know), and you can still use the startRow / stopRow of the scan, and > create multiple scan if necessary. I'm also interested in Lars' opinion on > this. > > Nicolas > > > > On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> wrote: > > > I have another question, if I am running a scan wrapped around multiple > > rows in the same region, in the following way: > > > > Scan scan = new scan(getWithMultipleRowsInSameRegion); > > > > Now, how does execution occur. Is it just a sequential scan across the > > entire region or does it seek to hfile blocks containing the actual > values. > > What I truly mean is, lets say the multi get is on following rows: > > > > Row1 : HFileBlock1 > > Row2 : HFileBlock20 > > Row3 : Does not exist > > Row4 : HFileBlock25 > > Row5 : HFileBlock100 > > > > The efficient way to do this would be to determine the correct blocks > using > > the index and then searching within the blocks for, say Row1. Then, seek > to > > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on > > seeking to + searching within HFileBlocks as needed. > > > > I am wondering if a scan wrapped around a Get with multiple rows would do > > the same ? > > > > Thanks > > Varun > > > > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]> > > wrote: > > > > > Looking at the code, it seems possible to do this server side within > the > > > multi invocation: we could group the get by region, and do a single > scan. > > > We could also add some heuristics if necessary... > > > > > > > > > > > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > > > > > I should qualify that statement, actually. > > > > > > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are > > > > returned. > > > > > > > > As James Taylor pointed out to me privately: A fairer comparison > would > > > > have been to run a scan with a filter that lets x% of the rows pass > > (i.e. > > > > the selectivity of the scan would be x%) and compare that to a multi > > Get > > > of > > > > the same x% of the row. > > > > > > > > There we found that a Scan+Filter is more efficient that issuing +
Nicolas Liochon 2013-02-19, 18:42
-
Re: Optimizing Multi Gets in hbaseNicolas Liochon 2013-02-19, 18:46
As well, an advantage of going only to the servers needed is the famous
MTTR: there are a less chance to go to a dead server or to a region that just moved. On Tue, Feb 19, 2013 at 7:42 PM, Nicolas Liochon <[EMAIL PROTECTED]> wrote: > Interesting, in the client we're doing a group by location the multiget. > So we could have the filter as HBase core code, and then we could use it > in the client for the multiget: compared to my initial proposal, we don't > have to change anything in the server code and we reuse the filtering > framework. The filter can be also be used independently. > > Is there any issue with this? The reseek seems to be quite smart in the > way it handles the bloom filters, I don't know if it behaves differently in > this case vs. a simple get. > > > On Tue, Feb 19, 2013 at 7:27 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > >> I was thinking along the same lines. Doing a skip scan via filter >> hinting. The problem is as you say that the Filter is instantiated >> everywhere and it might be of significant size (have to maintain all row >> keys you are looking for). >> >> >> RegionScanner now a reseek method, it is possible to do this via a >> coprocessor. They are also loaded per region (but at least not for each >> store), and one can use the shared coproc state I added to alleviate the >> memory concern. >> >> Thinking about this in terms of multiple scan is interesting. One could >> identify clusters of close row keys in the Gets and issue a Scan for each >> cluster. >> >> >> -- Lars >> >> >> >> ________________________________ >> From: Nicolas Liochon <[EMAIL PROTECTED]> >> To: user <[EMAIL PROTECTED]> >> Sent: Tuesday, February 19, 2013 9:28 AM >> Subject: Re: Optimizing Multi Gets in hbase >> >> Imho, the easiest thing to do would be to write a filter. >> You need to order the rows, then you can use hints to navigate to the next >> row (SEEK_NEXT_USING_HINT). >> The main drawback I see is that the filter will be invoked on all regions >> servers, including the ones that don't need it. But this would also means >> you have a very specific query pattern (which could be the case, I just >> don't know), and you can still use the startRow / stopRow of the scan, and >> create multiple scan if necessary. I'm also interested in Lars' opinion on >> this. >> >> Nicolas >> >> >> >> On Tue, Feb 19, 2013 at 4:52 PM, Varun Sharma <[EMAIL PROTECTED]> >> wrote: >> >> > I have another question, if I am running a scan wrapped around multiple >> > rows in the same region, in the following way: >> > >> > Scan scan = new scan(getWithMultipleRowsInSameRegion); >> > >> > Now, how does execution occur. Is it just a sequential scan across the >> > entire region or does it seek to hfile blocks containing the actual >> values. >> > What I truly mean is, lets say the multi get is on following rows: >> > >> > Row1 : HFileBlock1 >> > Row2 : HFileBlock20 >> > Row3 : Does not exist >> > Row4 : HFileBlock25 >> > Row5 : HFileBlock100 >> > >> > The efficient way to do this would be to determine the correct blocks >> using >> > the index and then searching within the blocks for, say Row1. Then, >> seek to >> > HFileBlock20 and then look for Row2. Elimininate Row3 and then keep on >> > seeking to + searching within HFileBlocks as needed. >> > >> > I am wondering if a scan wrapped around a Get with multiple rows would >> do >> > the same ? >> > >> > Thanks >> > Varun >> > >> > On Tue, Feb 19, 2013 at 12:37 AM, Nicolas Liochon <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Looking at the code, it seems possible to do this server side within >> the >> > > multi invocation: we could group the get by region, and do a single >> scan. >> > > We could also add some heuristics if necessary... >> > > >> > > >> > > >> > > On Tue, Feb 19, 2013 at 9:02 AM, lars hofhansl <[EMAIL PROTECTED]> >> wrote: >> > > >> > > > I should qualify that statement, actually. >> > > > >> > > > I was comparing scanning 1m KVs to getting 1m KVs when all KVs are >> > > > returned. +
Nicolas Liochon 2013-02-19, 18:46
|