Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Is it necessary to set MD5 on rowkey?

Copy link to this message
Re: Is it necessary to set MD5 on rowkey?
Michael Segel 2012-12-20, 20:15

Yes there is an implied definition of  the term 'salting' which to those with a CS or Software Engineering background will take to heart.
However it goes beyond this definition.

Per Lars and Alex, they are talking about bucketing the data.  Again this is not a good idea.
As you point out even using a modulo function( eg  the % math symbol  45%10 = 5) you are still creating a certain nondeterministic value in the key.

While this buckets the results in to 10 buckets, you no longer can do a single partial scan or a single get() to find that row.

This is what is implied in 'salting' .

Now this is much different than taking the hash, truncating the hash and then appending the key. ( hash(key).toString.substring[1,2]+key) [Ok not code but you get the idea]
Using this, I can still use a single get() to fetch the row if I know the key.  Again, with Salting you can't do that.

What I find troublesome is that there are better design patterns which solve the same problem.



On Dec 20, 2012, at 12:15 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> I think there's some hair-splitting going on here. The term "salting," by
> strict definition [0] from the cryptographic context, means the
> introduction of randomness to produce a one-way encoding of a value. The
> technique for rowkey design described here does not include the
> introduction of said randomness, nor is it a one-way operation. Instead, it
> divides the input more or less evenly across a fixed number of buckets.
> If you've introduced one-way randomness, you have no way to reproduce the
> rowkey after it's been inserted. Thus, accessing a specific row would be
> O(n) where n is the number of rows in the table -- you'd have to scan the
> table looking for the desired row. Use of a bucketing technique, on the
> other hand, adds a small amount of computation to calculate the true
> rowkey, making access to a single row a O(1) + C vs the number of rows in
> the table.
> -n
> [0]: http://en.wikipedia.org/wiki/Salt_(cryptography)
> On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>> Lars,
>> Ok... he's talking about buckets.
>> So when you have N buckets, what is the least number of get()s do you need
>> to fetch the single row?
>> (Hint: The answer is N)
>> How many scans? (N again)
>> Do you disagree?
>> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>> Mike, please think about what you write before you write it.
>>> You will most definitely not need a full table scan (much less a *FULL*
>> *TABLE* *SCAN* ;-) ).
>>> Read Alex's blog post again, it's a good post (IMHO). He is talking
>> about buckets.
>>> -- Lars
>>> ________________________________
>>> From: Michael Segel <[EMAIL PROTECTED]>
>>> Sent: Wednesday, December 19, 2012 5:23 PM
>>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>> Ok,
>>> Lets try this one more time...
>>> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
>> retrieve the row.
>>> If you do something like a salt that uses only  a preset of N
>> combinations, you will have to do N get()s in order to fetch the row.
>>> This is bad. VERY BAD.
>>> If you hash the row, you will get a consistent value each time you hash
>> the key.  If you use SHA-1, the odds of a collision are mathematically
>> possible, however highly improbable. So people have recommended that they
>> append the key to the hash to form the new key. Here, you might as well as
>> truncate the hash to just the most significant byte or two and the append
>> the key. This will give you enough of an even distribution that you can
>> avoid hot spotting.
>>> So if I use the hash, I can effectively still get the row of data back
>> with a single get(). Otherwise its a full table scan.
>>> Do you see the difference?
>>> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <