Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?

Copy link to this message
Re: Is it necessary to set MD5 on rowkey?

Maybe I'm missing something.
Why don't you walk me through the use of a salt example.
On Dec 19, 2012, at 12:37 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> I would disagree here.
> It depends on what you are doing and blanket statements about "this is very, very bad" typically do not help.
> Salting (even round robin) is very nice to distribute write load *and* it gives you a natural way to parallelize scans assuming scans are of reasonable size.
> If the typical use case is point gets then hashing or inverting keys would be preferable. As usual: It depends.
> -- Lars
> ________________________________
> From: Michael Segel <[EMAIL PROTECTED]>
> Sent: Tuesday, December 18, 2012 3:29 PM
> Subject: Re: Is it necessary to set MD5 on rowkey?
> Alex,
> And that's the point. Salt as you explain it conceptually implies that the number you are adding to the key to ensure a better distribution means that you will have inefficiencies in terms of scans and gets.
> Using a hash as either the full key, or taking the hash, truncating it and appending the key may screw up scans, but your get() is intact.
> There are other options like inverting the numeric key ...
> And of course doing nothing.
> Using a salt as part of the design pattern is bad.
> With respect to the OP, I was discussing the use of hash and some alternatives to how to implement the hash of a key.
> Again, doing nothing may also make sense too, if you understand the risks and you know how your data is going to be used.
> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:
>> Mike,
>> Please read *full post* before judge. In particular, "Hash-based
>> distribution" section. You can find the same in HBaseWD small README file
>> [1] (not sure if you read it at all before commenting on the lib). Round
>> robin is mainly for explaining the concept/idea (though not only for that).
>> Thank you,
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>> Solr
>> [1] https://github.com/sematext/HBaseWD
>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>> <[EMAIL PROTECTED]>wrote:
>>> Quick answer...
>>> Look at the salt.
>>> Its just a number from a round robin counter.
>>> There is no tie between the salt and row.
>>> So when you want to fetch a single row, how do you do it?
>>> ...
>>> ;-)
>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[EMAIL PROTECTED]>
>>> wrote:
>>>> Hello,
>>>> @Mike:
>>>> I'm the author of that post :).
>>>> Quick reply to your last comment:
>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>> idea" in more specific way than "Fetching data takes more effort". Would
>>> be
>>>> helpful for anyone who is looking into using this approach.
>>>> 2) The approach described in the post also says you can prefix with the
>>>> hash, you probably missed that.
>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>> Please re-read the question: the intention is to distribute the load
>>> while
>>>> still being able to do "partial key scans". The blog post linked above
>>>> explains one possible solution for that, while your answer doesn't.
>>>> @bigdata:
>>>> Basically when it comes to solving two issues: distributing writes and
>>>> having ability to read data sequentially, you have to balance between
>>> being
>>>> good at both of them. Very good presentation by Lars:
>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>>> ,
>>>> slide 22. You will see how this is correlated. In short:
>>>> * having md5/other hash prefix of the key does better w.r.t. distributing
>>>> writes, while compromises ability to do range scans efficiently
>>>> * having very limited number of 'salt' prefixes still allows to do range