I recently came across the pattern of adding a salting prefix to the
row keys to prevent hotspotting. Still trying to wrap my head around
it and I have a few questions.
- Is there ever a reason to salt to more buckets than there are region
servers? The only reason why I think that may be beneficial is to
anticipate future growth???
- Is it beneficial to always hash against a known number of buckets
(ie never change the size) that way for any individual row key you can
always determine the prefix?
- Are there any good use cases of this pattern out in the wild?
Well kept reading on this subject and realized my second question may
not be appropriate since this prefix salting pattern assumes that the
prefix is random. I thought it was actually based off a hash that
could be predetermined so you could alwasy, if needed, get to the
exact row key with one get. Would there be something wrong with doing
this.. ie, using a modulo of the hash of the key?
On Sat, May 17, 2014 at 8:28 PM, Software Dev <[EMAIL PROTECTED]> wrote:
No, there's nothing wrong with your thinking. That's exactly what Phoenix
does - use the modulo of the hash of the key. It's important that you can
calculate the prefix byte so that you can still do fast point lookups.
Using a modulo that's bigger than the number of region servers can make
sense as well (up to the overall number of cores in your cluster). You
can't change the modulo without rewriting the data, so factoring in future
growth makes sense.
On Sat, May 17, 2014 at 8:50 PM, Software Dev <[EMAIL PROTECTED]>wrote:
I think I should dust off my schema design talk… clearly the talks given by some of the vendors don’t really explain things …
(Hmmm. Strata London?)
See my reply below…. Note I used SHA-1. MD-5 should also give you roughly the same results.
On May 18, 2014, at 4:28 AM, Software Dev <[EMAIL PROTECTED]> wrote:
If you add a salt, you’re prepending a random number to a row in order to avoid hot spotting. It amazes me that Sematext never went back and either removed the blog or fixed it and now the bad idea is getting propagated. Adding a random value to give you a bit of randomness now means that you can’t do a range scan, or fetch the specific row with a single get() so you’re going to end up boiling the ocean to get your data. You’re better off using hive/spark/shark than hbase.
As James tries to point out, you take the hash of the row so that you can easily retrieve the value. But rather than prepend a 160 bit hash, you can easily achieve the same thing by just truncating the hash to the first byte in order to get enough randomness to avoid hot spotting. Of course, the one question you should ask is why don’t you just take the hash as the row key and then have a 160 bit row key (40 bytes in length)? Then store the actual key as a column in the table.
And then there’s a bigger question… why are you worried about hot spotting? Are you adding rows where the row key is sequential? Or are you worried about when you first start loading rows, that you are hot spotting, but the underlying row key is random enough that once the first set of rows are added, HBase splitting regions will be enough?
Think about how HBase splits regions.
Don’t take the modulo, just truncate to the first byte. Taking the modulo is again a dumb idea, but not as dumb as using a salt.
Keep in mind that the first byte of the hash is going to be 0-f in a character representation. (4 bits of the 160bit key) So you have 16 values to start with.
That should be enough.
Your question doesn’t make sense.
Deduping data sets.
NOTE: Many people worry about hot spotting when they really don’t have to do so. Hot spotting that occurs on a the initial load of a table is OK. Its when you have a sequential row key that you run in to problems with hot spotting and regions being only half filled.
You may be missing the point. The primary reason for the salt prefix
pattern is to avoid hotspotting when inserting time series data AND at
the same time provide a way to perform range scans.
The data being inserted will be a constant stream of time ordered data
so yes, hotspotting will be an issue
That's not accurate. To perform a range scan you would just need to
open up N scanners where N is the size of the buckets/random prefixes
Well the only reason why I would think using a salt would be
beneficial is to limit the number of scanners when performing a range
scan. See above comment. And yes, performing a range scan will be our
primary read pattern.
On Sun, May 18, 2014 at 2:36 AM, Michael Segel
<[EMAIL PROTECTED]> wrote:
James, thanks for the input. Not too familiar with Phoenix although it
looks like a great contrib. Unfortunately our main client is ruby
using the thrift api. Using the thrift api also makes parallel scans
tough, if not impossible.
On Sat, May 17, 2014 at 9:31 PM, James Taylor <[EMAIL PROTECTED]> wrote:
No, you’re missing the point.
Its not a good idea or design.
Is your data mutable or static?
To your point. Everytime you want to do a simple get() you have to open up n get() statements. On your range scans you will have to do n range scans, then join and sort the result sets. The fact that each result set is in sort order will help a little, but still not that clean.
On May 18, 2014, at 4:58 PM, Software Dev <[EMAIL PROTECTED]> wrote:
In our measurements, scanning is improved by performing against n
range scans rather than 1 (since you are effectively striping the
reads). This is even better when you don't necessary care about the
order of every row, but want every row in a given range (then you can
just get whatever row is available from a buffer in the client).
On Sun, May 18, 2014 at 1:07 PM, Michael Segel
<[EMAIL PROTECTED]> wrote:
@Software Dev - might be feasible to implement a Thrift client that speaks
Phoenix JDBC. I believe this is similar to what Hive has done.
On Sun, May 18, 2014 at 1:19 PM, Mike Axiak <[EMAIL PROTECTED]> wrote:
You do realize that in the general case you want to return the result set in sort order.
So you will have to put the resulting range scans in sort order.
If you’re saying that you don’t care about the order of the row sets… then why are you using a sequential row key which causes hot spotting in the first place?
On May 18, 2014, at 9:19 PM, Mike Axiak <[EMAIL PROTECTED]> wrote:
1) You can still query in sorted order, in which case N scans is
beneficial. (In our tests: ~25% faster for N=2, going up to about ~50%
faster for N=16.)
2) Many times you would issue a scan without necessarily caring about
individual record order. (e.g.: let me perform some operation on all
events in this hour), while still requiring the ordering can work in
the general case.
On Mon, May 19, 2014 at 3:24 AM, Michael Segel
<[EMAIL PROTECTED]> wrote:
You have n different scans and you then have to put the rows in sort order from each scan in to a single result set.
While in each scan, the RS is in sort order, the overall set of RS needs to be merged in to one RS and that’s where you start to have issues.
And again… depending on what you want, in the general case, you need the RS in sort order.
On May 19, 2014, at 1:24 PM, Mike Axiak <[EMAIL PROTECTED]> wrote:
On Mon, May 19, 2014 at 8:53 AM, Michael Segel
<[EMAIL PROTECTED]> wrote:
What issues? As I said, in multiple tests we saw performance
improvements across the board with this strategy.
You run n scans in parallel.
You want a single result set in sort order.
How do you do that?
That’s the extra work that you don’t have when you have a single result set.
This goes in to why the work done for secondary indexing to be associated with the base table won’t scale or work when you have to consider joins.
Remember to think beyond your specific use case and think about a general use case. You said that you thought about not caring about the RS order when in the general case you have to consider it.
Think of it this way…
In many RDBMSs you have two ways to handle parallelism.
You can partition your data in a round robin fashion, or you can partition your data against a range.
In one use case, the client used a date range partition. That is that they created a partition based on the month of the data verus just storing it on a round robin fashion.
In one you get a high degree of parallelism because you’re going against the data that’s spread across the nodes in the database.
In the other, your data is segmented so you’re only going after a subset of your data that’s local on to a single system.
Which is better?
Which is more efficient?
On May 19, 2014, at 2:00 PM, Mike Axiak <[EMAIL PROTECTED]> wrote: