Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Michael Segel 2012-12-20, 20:15
Nick,

Yes there is an implied definition of  the term 'salting' which to those with a CS or Software Engineering background will take to heart.
However it goes beyond this definition.

Per Lars and Alex, they are talking about bucketing the data.  Again this is not a good idea.
As you point out even using a modulo function( eg  the % math symbol  45%10 = 5) you are still creating a certain nondeterministic value in the key.

While this buckets the results in to 10 buckets, you no longer can do a single partial scan or a single get() to find that row.

This is what is implied in 'salting' .

Now this is much different than taking the hash, truncating the hash and then appending the key. ( hash(key).toString.substring[1,2]+key) [Ok not code but you get the idea]
Using this, I can still use a single get() to fetch the row if I know the key.  Again, with Salting you can't do that.

What I find troublesome is that there are better design patterns which solve the same problem.

HTH

-Mike

On Dec 20, 2012, at 12:15 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> I think there's some hair-splitting going on here. The term "salting," by
> strict definition [0] from the cryptographic context, means the
> introduction of randomness to produce a one-way encoding of a value. The
> technique for rowkey design described here does not include the
> introduction of said randomness, nor is it a one-way operation. Instead, it
> divides the input more or less evenly across a fixed number of buckets.
>
> If you've introduced one-way randomness, you have no way to reproduce the
> rowkey after it's been inserted. Thus, accessing a specific row would be
> O(n) where n is the number of rows in the table -- you'd have to scan the
> table looking for the desired row. Use of a bucketing technique, on the
> other hand, adds a small amount of computation to calculate the true
> rowkey, making access to a single row a O(1) + C vs the number of rows in
> the table.
>
> -n
>
> [0]: http://en.wikipedia.org/wiki/Salt_(cryptography)
>
> On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> Lars,
>>
>> Ok... he's talking about buckets.
>>
>> So when you have N buckets, what is the least number of get()s do you need
>> to fetch the single row?
>> (Hint: The answer is N)
>>
>> How many scans? (N again)
>>
>> Do you disagree?
>>
>>
>> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>
>>> Mike, please think about what you write before you write it.
>>> You will most definitely not need a full table scan (much less a *FULL*
>> *TABLE* *SCAN* ;-) ).
>>>
>>> Read Alex's blog post again, it's a good post (IMHO). He is talking
>> about buckets.
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>> From: Michael Segel <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Wednesday, December 19, 2012 5:23 PM
>>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>>
>>> Ok,
>>>
>>> Lets try this one more time...
>>>
>>> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
>> retrieve the row.
>>> If you do something like a salt that uses only  a preset of N
>> combinations, you will have to do N get()s in order to fetch the row.
>>>
>>> This is bad. VERY BAD.
>>>
>>> If you hash the row, you will get a consistent value each time you hash
>> the key.  If you use SHA-1, the odds of a collision are mathematically
>> possible, however highly improbable. So people have recommended that they
>> append the key to the hash to form the new key. Here, you might as well as
>> truncate the hash to just the most significant byte or two and the append
>> the key. This will give you enough of an even distribution that you can
>> avoid hot spotting.
>>>
>>> So if I use the hash, I can effectively still get the row of data back
>> with a single get(). Otherwise its a full table scan.
>>>
>>> Do you see the difference?
>>>
>>>
>>> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <
+
k8 robot 2013-02-06, 01:46