Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
lars hofhansl 2012-12-19, 20:51
Doesn't Alex' blog post do that?
________________________________
 From: Michael Segel <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Sent: Wednesday, December 19, 2012 11:46 AM
Subject: Re: Is it necessary to set MD5 on rowkey?
 
Ok,

Maybe I'm missing something.
Why don't you walk me through the use of a salt example.
On Dec 19, 2012, at 12:37 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> I would disagree here.
> It depends on what you are doing and blanket statements about "this is very, very bad" typically do not help.
>
> Salting (even round robin) is very nice to distribute write load *and* it gives you a natural way to parallelize scans assuming scans are of reasonable size.
>
> If the typical use case is point gets then hashing or inverting keys would be preferable. As usual: It depends.
>
> -- Lars
>
>
>
> ________________________________
> From: Michael Segel <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Tuesday, December 18, 2012 3:29 PM
> Subject: Re: Is it necessary to set MD5 on rowkey?
>
> Alex,
> And that's the point. Salt as you explain it conceptually implies that the number you are adding to the key to ensure a better distribution means that you will have inefficiencies in terms of scans and gets.
>
> Using a hash as either the full key, or taking the hash, truncating it and appending the key may screw up scans, but your get() is intact.
>
> There are other options like inverting the numeric key ...
>
> And of course doing nothing.
>
> Using a salt as part of the design pattern is bad.
>
> With respect to the OP, I was discussing the use of hash and some alternatives to how to implement the hash of a key.
> Again, doing nothing may also make sense too, if you understand the risks and you know how your data is going to be used.
>
>
> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:
>
>> Mike,
>>
>> Please read *full post* before judge. In particular, "Hash-based
>> distribution" section. You can find the same in HBaseWD small README file
>> [1] (not sure if you read it at all before commenting on the lib). Round
>> robin is mainly for explaining the concept/idea (though not only for that).
>>
>> Thank you,
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>> Solr
>>
>> [1] https://github.com/sematext/HBaseWD
>>
>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>> <[EMAIL PROTECTED]>wrote:
>>
>>> Quick answer...
>>>
>>> Look at the salt.
>>> Its just a number from a round robin counter.
>>> There is no tie between the salt and row.
>>>
>>> So when you want to fetch a single row, how do you do it?
>>> ...
>>> ;-)
>>>
>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> @Mike:
>>>>
>>>> I'm the author of that post :).
>>>>
>>>> Quick reply to your last comment:
>>>>
>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>> idea" in more specific way than "Fetching data takes more effort". Would
>>> be
>>>> helpful for anyone who is looking into using this approach.
>>>>
>>>> 2) The approach described in the post also says you can prefix with the
>>>> hash, you probably missed that.
>>>>
>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>> Please re-read the question: the intention is to distribute the load
>>> while
>>>> still being able to do "partial key scans". The blog post linked above
>>>> explains one possible solution for that, while your answer doesn't.
>>>>
>>>> @bigdata:
>>>>
>>>> Basically when it comes to solving two issues: distributing writes and
>>>> having ability to read data sequentially, you have to balance between
>>> being
>>>> good at both of them. Very good presentation by Lars:
>>>>
>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46