Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Rowkey hashing to avoid hotspotting


Copy link to this message
-
Re: Rowkey hashing to avoid hotspotting
Reading hot spotting?
Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 17, 2012, at 10:53 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:

> The most common reason for RS hotspotting during writing data in HBase is
> writing rows with monotonically increasing/decreasing row keys. E.g. if you
> put timestamp in the first part of your key, then you are likely to have
> monotonically increasing row keys. You can find more info about this issue
> and how to solve it here: [1] and also you may want to look at already
> implemented salting solution [2].
>
> As for RS hotspotting during reading - it is hard to predict without
> knowing what it the most common data access patterns. E.g. putting model #
> in first part of a key may seem like a good distribution, but if your web
> site used mostly by Mercedes owners, the majority of the read load may be
> directed to just few regions. Again, salting can help a lot here.
>
> +1 to what Cristofer said on other things, esp: use partial key scans were
> possible instead of filters and pre-split your table.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
>
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> [EMAIL PROTECTED]> wrote:
>
>> Hi Cristofer,
>>
>> Thanks for elaborate response!!!
>>
>> I have no much information about production data as I work with partial
>> data. But based on discussion with my project partners, I have some answers
>> for you.
>>
>> Number of model numbers and serial numbers will be finite. Not so many...
>> As far as I know,there is no predefined rule for model number or serial
>> number creation.
>>
>> I have two access pattern. I count the number of rows for a specific model
>> number. I use rowkey filter for this. Also I filter the rows based on
>> model, serial number and some other columns. I scan the table with column
>> value filter for this case.
>>
>> I will evaluate salting as you have explained.
>>
>> Regards,
>> Anand.C
>>
>> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hi Anand,
>>>
>>> As usual, the answer is that 'it depends'  :)
>>>
>>> I think that the main question here is: why are you afraid that this
>> setup
>>> would lead to region server hotspotting? Is because you don't know how
>> your
>>> production data will seems?
>>>
>>> Based on what you told about your rowkey, you will query mostly by
>>> providing model no. + serial no., but:
>>> 1 - How is your rowkey distribution? There are tons of different
>>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
>>> serialNumbers? Few of both?
>>> 2 - Putting modelNumber in front of your rowkey means that your data will
>>> be sorted by rowkey. So, what is the rule that determinates a modelNumber
>>> creation? Is it a sequential number that will be increased by time? If
>> so,
>>> are newer members accessed a lot more than older members? If not, what
>> will
>>> drive this number? Is it an encoding rule?
>>> 3 - Do you expect more write/read load over a few of these modelNumbers
>>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
>>> Distributed over what?
>>>
>>> Also, two other things got my attention here...
>>> 1 - Why are you filtering with regex? If your queries are over model no.
>> +
>>> serial no., why don't you just scan starting by your
>>> modelNumber+SerialNumber, and stoping on your next
>>> modelNumber+SerialNumber? Or is there another access pattern that doesn't
>>> apply to your composited rowkey?
>>> 2 - Why do you have to add a timestamp to ensure uniqueness?
>>>
>>> Now, answering your question without more info about your data, you can
>>> apply hash in two ways:
>>> 1 - Generating a hash (MD5 is the most common as far as I read about) and
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB