Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: Rowkey hashing to avoid hotspotting


+
AnandaVelMurugan Chandra ... 2012-07-18, 16:04
+
AnandaVelMurugan Chandra ... 2012-07-19, 15:08
+
Alex Baranau 2012-07-19, 15:22
+
syed kather 2012-07-19, 16:52
+
AnandaVelMurugan Chandra ... 2012-07-20, 01:41
+
AnandaVelMurugan Chandra ... 2012-07-16, 05:30
+
AnandaVelMurugan Chandra ... 2012-07-17, 14:44
+
Alex Baranau 2012-07-17, 15:53
Copy link to this message
-
Re: Rowkey hashing to avoid hotspotting
Reading hot spotting?
Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 17, 2012, at 10:53 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:

> The most common reason for RS hotspotting during writing data in HBase is
> writing rows with monotonically increasing/decreasing row keys. E.g. if you
> put timestamp in the first part of your key, then you are likely to have
> monotonically increasing row keys. You can find more info about this issue
> and how to solve it here: [1] and also you may want to look at already
> implemented salting solution [2].
>
> As for RS hotspotting during reading - it is hard to predict without
> knowing what it the most common data access patterns. E.g. putting model #
> in first part of a key may seem like a good distribution, but if your web
> site used mostly by Mercedes owners, the majority of the read load may be
> directed to just few regions. Again, salting can help a lot here.
>
> +1 to what Cristofer said on other things, esp: use partial key scans were
> possible instead of filters and pre-split your table.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
>
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Cristofer,
>>
>> Thanks for elaborate response!!!
>>
>> I have no much information about production data as I work with partial
>> data. But based on discussion with my project partners, I have some answers
>> for you.
>>
>> Number of model numbers and serial numbers will be finite. Not so many...
>> As far as I know,there is no predefined rule for model number or serial
>> number creation.
>>
>> I have two access pattern. I count the number of rows for a specific model
>> number. I use rowkey filter for this. Also I filter the rows based on
>> model, serial number and some other columns. I scan the table with column
>> value filter for this case.
>>
>> I will evaluate salting as you have explained.
>>
>> Regards,
>> Anand.C
>>
>> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Anand,
>>>
>>> As usual, the answer is that 'it depends'  :)
>>>
>>> I think that the main question here is: why are you afraid that this
>> setup
>>> would lead to region server hotspotting? Is because you don't know how
>> your
>>> production data will seems?
>>>
>>> Based on what you told about your rowkey, you will query mostly by
>>> providing model no. + serial no., but:
>>> 1 - How is your rowkey distribution? There are tons of different
>>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
>>> serialNumbers? Few of both?
>>> 2 - Putting modelNumber in front of your rowkey means that your data will
>>> be sorted by rowkey. So, what is the rule that determinates a modelNumber
>>> creation? Is it a sequential number that will be increased by time? If
>> so,
>>> are newer members accessed a lot more than older members? If not, what
>> will
>>> drive this number? Is it an encoding rule?
>>> 3 - Do you expect more write/read load over a few of these modelNumbers
>>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
>>> Distributed over what?
>>>
>>> Also, two other things got my attention here...
>>> 1 - Why are you filtering with regex? If your queries are over model no.
>> +
>>> serial no., why don't you just scan starting by your
>>> modelNumber+SerialNumber, and stoping on your next
>>> modelNumber+SerialNumber? Or is there another access pattern that doesn't
>>> apply to your composited rowkey?
>>> 2 - Why do you have to add a timestamp to ensure uniqueness?
>>>
>>> Now, answering your question without more info about your data, you can
>>> apply hash in two ways:
>>> 1 - Generating a hash (MD5 is the most common as far as I read about) and
+
Alex Baranau 2012-07-17, 18:49