Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
+
Michael Segel 2012-12-20, 13:20
+
Nick Dimiduk 2012-12-20, 18:15
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Nick,

Yes there is an implied definition of  the term 'salting' which to those with a CS or Software Engineering background will take to heart.
However it goes beyond this definition.

Per Lars and Alex, they are talking about bucketing the data.  Again this is not a good idea.
As you point out even using a modulo function( eg  the % math symbol  45%10 = 5) you are still creating a certain nondeterministic value in the key.

While this buckets the results in to 10 buckets, you no longer can do a single partial scan or a single get() to find that row.

This is what is implied in 'salting' .

Now this is much different than taking the hash, truncating the hash and then appending the key. ( hash(key).toString.substring[1,2]+key) [Ok not code but you get the idea]
Using this, I can still use a single get() to fetch the row if I know the key.  Again, with Salting you can't do that.

What I find troublesome is that there are better design patterns which solve the same problem.

HTH

-Mike

On Dec 20, 2012, at 12:15 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:

> I think there's some hair-splitting going on here. The term "salting," by
> strict definition [0] from the cryptographic context, means the
> introduction of randomness to produce a one-way encoding of a value. The
> technique for rowkey design described here does not include the
> introduction of said randomness, nor is it a one-way operation. Instead, it
> divides the input more or less evenly across a fixed number of buckets.
>
> If you've introduced one-way randomness, you have no way to reproduce the
> rowkey after it's been inserted. Thus, accessing a specific row would be
> O(n) where n is the number of rows in the table -- you'd have to scan the
> table looking for the desired row. Use of a bucketing technique, on the
> other hand, adds a small amount of computation to calculate the true
> rowkey, making access to a single row a O(1) + C vs the number of rows in
> the table.
>
> -n
>
> [0]: http://en.wikipedia.org/wiki/Salt_(cryptography)
>
> On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> Lars,
>>
>> Ok... he's talking about buckets.
>>
>> So when you have N buckets, what is the least number of get()s do you need
>> to fetch the single row?
>> (Hint: The answer is N)
>>
>> How many scans? (N again)
>>
>> Do you disagree?
>>
>>
>> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>>
>>> Mike, please think about what you write before you write it.
>>> You will most definitely not need a full table scan (much less a *FULL*
>> *TABLE* *SCAN* ;-) ).
>>>
>>> Read Alex's blog post again, it's a good post (IMHO). He is talking
>> about buckets.
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>> From: Michael Segel <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Wednesday, December 19, 2012 5:23 PM
>>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>>
>>> Ok,
>>>
>>> Lets try this one more time...
>>>
>>> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
>> retrieve the row.
>>> If you do something like a salt that uses only  a preset of N
>> combinations, you will have to do N get()s in order to fetch the row.
>>>
>>> This is bad. VERY BAD.
>>>
>>> If you hash the row, you will get a consistent value each time you hash
>> the key.  If you use SHA-1, the odds of a collision are mathematically
>> possible, however highly improbable. So people have recommended that they
>> append the key to the hash to form the new key. Here, you might as well as
>> truncate the hash to just the most significant byte or two and the append
>> the key. This will give you enough of an even distribution that you can
>> avoid hot spotting.
>>>
>>> So if I use the hash, I can effectively still get the row of data back
>> with a single get(). Otherwise its a full table scan.
>>>
>>> Do you see the difference?
>>>
>>>
>>> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <
+
k8 robot 2013-02-06, 01:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB