Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Ok...

So you use a random byte or two at the front of the row.
How do you then use get() to find the row?
How do you do a partial scan()?

Do you start to see the problem?
The only way to get to the row is to do a full table scan. That kills HBase and you would be better off going with a partitioned Hive table.

Using a hash of the key or a portion of the hash is not a salt.
That's not what I have a problem with. Each time you want to fetch the key, you just hash it, truncate the hash and then prepend it to the key. You will then be able to use get().

Using a salt would imply using some form of a modulo math to get a round robin prefix.  Or a random number generator.

That's the issue.

Does that make sense?

On Dec 19, 2012, at 3:26 PM, David Arthur <[EMAIL PROTECTED]> wrote:

> Let's say you want to decompose a url into domain and path to include in your row key.
>
> You could of course just use the url as the key, but you will see hotspotting since most will start with "http". To mitigate this, you could add a random byte or two at the beginning (random salt) to improve distribution of keys, but you break single record Gets (and Scans arguably). Another approach is to use a hash-based salt: hash the whole key and use a few of those bytes as a salt. This fixes Gets but Scans are still not effective.
>
> One approach I've taken is to hash only a part of the key. Consider the following key structure
>
> <2 bytes of hash(domain)><domain><path>
>
> With this you get 16 bits for a hash-based salt. The salt is deterministic so Gets work fine, and for a single domain the salt is the same so you can easily do Scans across a domain. If you had some further structure to your key that you wished to scan across, you could do something like:
>
> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path>
>
> It really boils down to identifying your access patterns and read/write requirements and constructing a row key accordingly.
>
> HTH,
> David
>
> On 12/18/12 6:29 PM, Michael Segel wrote:
>> Alex,
>> And that's the point. Salt as you explain it conceptually implies that the number you are adding to the key to ensure a better distribution means that you will have inefficiencies in terms of scans and gets.
>>
>> Using a hash as either the full key, or taking the hash, truncating it and appending the key may screw up scans, but your get() is intact.
>>
>> There are other options like inverting the numeric key ...
>>
>> And of course doing nothing.
>>
>> Using a salt as part of the design pattern is bad.
>>
>> With respect to the OP, I was discussing the use of hash and some alternatives to how to implement the hash of a key.
>> Again, doing nothing may also make sense too, if you understand the risks and you know how your data is going to be used.
>>
>>
>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:
>>
>>> Mike,
>>>
>>> Please read *full post* before judge. In particular, "Hash-based
>>> distribution" section. You can find the same in HBaseWD small README file
>>> [1] (not sure if you read it at all before commenting on the lib). Round
>>> robin is mainly for explaining the concept/idea (though not only for that).
>>>
>>> Thank you,
>>> Alex Baranau
>>> ------
>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>>> Solr
>>>
>>> [1] https://github.com/sematext/HBaseWD
>>>
>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>>> <[EMAIL PROTECTED]>wrote:
>>>
>>>> Quick answer...
>>>>
>>>> Look at the salt.
>>>> Its just a number from a round robin counter.
>>>> There is no tie between the salt and row.
>>>>
>>>> So when you want to fetch a single row, how do you do it?
>>>> ...
>>>> ;-)
>>>>
>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> @Mike:
>>>>>
>>>>> I'm the author of that post :).
>>>>>
>>>>> Quick reply to your last comment:
>>>>>
>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46