Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Ok,

Maybe I'm missing something.
Why don't you walk me through the use of a salt example.
On Dec 19, 2012, at 12:37 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> I would disagree here.
> It depends on what you are doing and blanket statements about "this is very, very bad" typically do not help.
>
> Salting (even round robin) is very nice to distribute write load *and* it gives you a natural way to parallelize scans assuming scans are of reasonable size.
>
> If the typical use case is point gets then hashing or inverting keys would be preferable. As usual: It depends.
>
> -- Lars
>
>
>
> ________________________________
> From: Michael Segel <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, December 18, 2012 3:29 PM
> Subject: Re: Is it necessary to set MD5 on rowkey?
>
> Alex,
> And that's the point. Salt as you explain it conceptually implies that the number you are adding to the key to ensure a better distribution means that you will have inefficiencies in terms of scans and gets.
>
> Using a hash as either the full key, or taking the hash, truncating it and appending the key may screw up scans, but your get() is intact.
>
> There are other options like inverting the numeric key ...
>
> And of course doing nothing.
>
> Using a salt as part of the design pattern is bad.
>
> With respect to the OP, I was discussing the use of hash and some alternatives to how to implement the hash of a key.
> Again, doing nothing may also make sense too, if you understand the risks and you know how your data is going to be used.
>
>
> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:
>
>> Mike,
>>
>> Please read *full post* before judge. In particular, "Hash-based
>> distribution" section. You can find the same in HBaseWD small README file
>> [1] (not sure if you read it at all before commenting on the lib). Round
>> robin is mainly for explaining the concept/idea (though not only for that).
>>
>> Thank you,
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>> Solr
>>
>> [1] https://github.com/sematext/HBaseWD
>>
>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>> <[EMAIL PROTECTED]>wrote:
>>
>>> Quick answer...
>>>
>>> Look at the salt.
>>> Its just a number from a round robin counter.
>>> There is no tie between the salt and row.
>>>
>>> So when you want to fetch a single row, how do you do it?
>>> ...
>>> ;-)
>>>
>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> @Mike:
>>>>
>>>> I'm the author of that post :).
>>>>
>>>> Quick reply to your last comment:
>>>>
>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>> idea" in more specific way than "Fetching data takes more effort". Would
>>> be
>>>> helpful for anyone who is looking into using this approach.
>>>>
>>>> 2) The approach described in the post also says you can prefix with the
>>>> hash, you probably missed that.
>>>>
>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>> Please re-read the question: the intention is to distribute the load
>>> while
>>>> still being able to do "partial key scans". The blog post linked above
>>>> explains one possible solution for that, while your answer doesn't.
>>>>
>>>> @bigdata:
>>>>
>>>> Basically when it comes to solving two issues: distributing writes and
>>>> having ability to read data sequentially, you have to balance between
>>> being
>>>> good at both of them. Very good presentation by Lars:
>>>>
>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>>> ,
>>>> slide 22. You will see how this is correlated. In short:
>>>> * having md5/other hash prefix of the key does better w.r.t. distributing
>>>> writes, while compromises ability to do range scans efficiently
>>>> * having very limited number of 'salt' prefixes still allows to do range
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB