Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Is it necessary to set MD5 on rowkey?


Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Michael Segel 2012-12-19, 21:02
I think you missed the point.
You seem to think that salting is ok.
I want you to walk through an example so that we can discuss it. ;-)
On Dec 19, 2012, at 2:51 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Doesn't Alex' blog post do that?
>
>
>
>
> ________________________________
> From: Michael Segel <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> Sent: Wednesday, December 19, 2012 11:46 AM
> Subject: Re: Is it necessary to set MD5 on rowkey?
>
> Ok,
>
> Maybe I'm missing something.
> Why don't you walk me through the use of a salt example.
>
>
> On Dec 19, 2012, at 12:37 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
>> I would disagree here.
>> It depends on what you are doing and blanket statements about "this is very, very bad" typically do not help.
>>
>> Salting (even round robin) is very nice to distribute write load *and* it gives you a natural way to parallelize scans assuming scans are of reasonable size.
>>
>> If the typical use case is point gets then hashing or inverting keys would be preferable. As usual: It depends.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>> From: Michael Segel <[EMAIL PROTECTED]>
>> To: [EMAIL PROTECTED]
>> Sent: Tuesday, December 18, 2012 3:29 PM
>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>
>> Alex,
>> And that's the point. Salt as you explain it conceptually implies that the number you are adding to the key to ensure a better distribution means that you will have inefficiencies in terms of scans and gets.
>>
>> Using a hash as either the full key, or taking the hash, truncating it and appending the key may screw up scans, but your get() is intact.
>>
>> There are other options like inverting the numeric key ...
>>
>> And of course doing nothing.
>>
>> Using a salt as part of the design pattern is bad.
>>
>> With respect to the OP, I was discussing the use of hash and some alternatives to how to implement the hash of a key.
>> Again, doing nothing may also make sense too, if you understand the risks and you know how your data is going to be used.
>>
>>
>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[EMAIL PROTECTED]> wrote:
>>
>>> Mike,
>>>
>>> Please read *full post* before judge. In particular, "Hash-based
>>> distribution" section. You can find the same in HBaseWD small README file
>>> [1] (not sure if you read it at all before commenting on the lib). Round
>>> robin is mainly for explaining the concept/idea (though not only for that).
>>>
>>> Thank you,
>>> Alex Baranau
>>> ------
>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>>> Solr
>>>
>>> [1] https://github.com/sematext/HBaseWD
>>>
>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
>>> <[EMAIL PROTECTED]>wrote:
>>>
>>>> Quick answer...
>>>>
>>>> Look at the salt.
>>>> Its just a number from a round robin counter.
>>>> There is no tie between the salt and row.
>>>>
>>>> So when you want to fetch a single row, how do you do it?
>>>> ...
>>>> ;-)
>>>>
>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> @Mike:
>>>>>
>>>>> I'm the author of that post :).
>>>>>
>>>>> Quick reply to your last comment:
>>>>>
>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>>> idea" in more specific way than "Fetching data takes more effort". Would
>>>> be
>>>>> helpful for anyone who is looking into using this approach.
>>>>>
>>>>> 2) The approach described in the post also says you can prefix with the
>>>>> hash, you probably missed that.
>>>>>
>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>>> Please re-read the question: the intention is to distribute the load
>>>> while
>>>>> still being able to do "partial key scans". The blog post linked above
>>>>> explains one possible solution for that, while your answer doesn't.
>>>>>
>>>>