Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Quick answer...

Look at the salt.
Its just a number from a round robin counter.
There is no tie between the salt and row.

So when you want to fetch a single row, how do you do it?
...
;-)

On Dec 18, 2012, at 11:12 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:

> Hello,
>
> @Mike:
>
> I'm the author of that post :).
>
> Quick reply to your last comment:
>
> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
> idea" in more specific way than "Fetching data takes more effort". Would be
> helpful for anyone who is looking into using this approach.
>
> 2) The approach described in the post also says you can prefix with the
> hash, you probably missed that.
>
> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
> Please re-read the question: the intention is to distribute the load while
> still being able to do "partial key scans". The blog post linked above
> explains one possible solution for that, while your answer doesn't.
>
> @bigdata:
>
> Basically when it comes to solving two issues: distributing writes and
> having ability to read data sequentially, you have to balance between being
> good at both of them. Very good presentation by Lars:
> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012,
> slide 22. You will see how this is correlated. In short:
> * having md5/other hash prefix of the key does better w.r.t. distributing
> writes, while compromises ability to do range scans efficiently
> * having very limited number of 'salt' prefixes still allows to do range
> scans (less efficiently than normal range scans, of course, but still good
> enough in many cases) while providing worse distribution of writes
>
> In the latter case by choosing number of possible 'salt' prefixes (which
> could be derived from hashed values, etc.) you can balance between
> distributing writes efficiency and ability to run fast range scans.
>
> Hope this helps
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
>
> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>>
>> Hi,
>>
>> First, the use of a 'Salt' is a very, very bad idea and I would really
>> hope that the author of that blog take it down.
>> While it may solve an initial problem in terms of region hot spotting, it
>> creates another problem when it comes to fetching data. Fetching data takes
>> more effort.
>>
>> With respect to using a hash (MD5 or SHA-1) you are creating a more random
>> key that is unique to the record.  Some would argue that using MD5 or SHA-1
>> that mathematically you could have a collision, however you could then
>> append the key to the hash to guarantee uniqueness. You could also do
>> things like take the hash and then truncate it to the first byte and then
>> append the record key. This should give you enough randomness to avoid hot
>> spotting after the initial region completion and you could pre-split out
>> any number of regions. (First byte 0-255 for values, so you can program the
>> split...
>>
>>
>> Having said that... yes, you lose the ability to perform a sequential scan
>> of the data.  At least to a point.  It depends on your schema.
>>
>> Note that you need to think about how you are primarily going to access
>> the data.  You can then determine the best way to store the data to gain
>> the best performance. For some applications... the region hot spotting
>> isn't an important issue.
>>
>> Note YMMV
>>
>> HTH
>>
>> -Mike
>>
>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[EMAIL PROTECTED]> wrote:
>>
>>> Hello,
>>>
>>> There is middle term betwen sequecial keys (hot spoting risk) and md5
>>> (heavy scan):
>>> * you can use composed keys with a field that can segregate data
>>> (hostname, productname, metric name) like OpenTSDB
>>> * or use Salt with a limited number of values (example
>>> substr(md5(rowid),0,1) = 16 values)
>>>   so that a scan is a combination of 16 filters on on each salt values
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46