Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Rowkey hashing to avoid hotspotting

Copy link to this message
Re: Rowkey hashing to avoid hotspotting
Reading hot spotting?
Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 17, 2012, at 10:53 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:

> The most common reason for RS hotspotting during writing data in HBase is
> writing rows with monotonically increasing/decreasing row keys. E.g. if you
> put timestamp in the first part of your key, then you are likely to have
> monotonically increasing row keys. You can find more info about this issue
> and how to solve it here: [1] and also you may want to look at already
> implemented salting solution [2].
> As for RS hotspotting during reading - it is hard to predict without
> knowing what it the most common data access patterns. E.g. putting model #
> in first part of a key may seem like a good distribution, but if your web
> site used mostly by Mercedes owners, the majority of the read load may be
> directed to just few regions. Again, salting can help a lot here.
> +1 to what Cristofer said on other things, esp: use partial key scans were
> possible instead of filters and pre-split your table.
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
>> Hi Cristofer,
>> Thanks for elaborate response!!!
>> I have no much information about production data as I work with partial
>> data. But based on discussion with my project partners, I have some answers
>> for you.
>> Number of model numbers and serial numbers will be finite. Not so many...
>> As far as I know,there is no predefined rule for model number or serial
>> number creation.
>> I have two access pattern. I count the number of rows for a specific model
>> number. I use rowkey filter for this. Also I filter the rows based on
>> model, serial number and some other columns. I scan the table with column
>> value filter for this case.
>> I will evaluate salting as you have explained.
>> Regards,
>> Anand.C
>> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
>> [EMAIL PROTECTED]> wrote:
>>> Hi Anand,
>>> As usual, the answer is that 'it depends'  :)
>>> I think that the main question here is: why are you afraid that this
>> setup
>>> would lead to region server hotspotting? Is because you don't know how
>> your
>>> production data will seems?
>>> Based on what you told about your rowkey, you will query mostly by
>>> providing model no. + serial no., but:
>>> 1 - How is your rowkey distribution? There are tons of different
>>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
>>> serialNumbers? Few of both?
>>> 2 - Putting modelNumber in front of your rowkey means that your data will
>>> be sorted by rowkey. So, what is the rule that determinates a modelNumber
>>> creation? Is it a sequential number that will be increased by time? If
>> so,
>>> are newer members accessed a lot more than older members? If not, what
>> will
>>> drive this number? Is it an encoding rule?
>>> 3 - Do you expect more write/read load over a few of these modelNumbers
>>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
>>> Distributed over what?
>>> Also, two other things got my attention here...
>>> 1 - Why are you filtering with regex? If your queries are over model no.
>> +
>>> serial no., why don't you just scan starting by your
>>> modelNumber+SerialNumber, and stoping on your next
>>> modelNumber+SerialNumber? Or is there another access pattern that doesn't
>>> apply to your composited rowkey?
>>> 2 - Why do you have to add a timestamp to ensure uniqueness?
>>> Now, answering your question without more info about your data, you can
>>> apply hash in two ways:
>>> 1 - Generating a hash (MD5 is the most common as far as I read about) and