Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Is it necessary to set MD5 on rowkey?


+
bigdata 2012-12-18, 09:20
+
Doug Meil 2012-12-18, 13:40
+
Damien Hardy 2012-12-18, 09:33
+
Michael Segel 2012-12-18, 13:52
+
bigdata 2012-12-18, 15:20
+
Alex Baranau 2012-12-18, 17:12
+
Michael Segel 2012-12-18, 17:24
+
Alex Baranau 2012-12-18, 17:36
+
Michael Segel 2012-12-18, 23:29
+
lars hofhansl 2012-12-19, 18:37
+
Michael Segel 2012-12-19, 19:46
+
lars hofhansl 2012-12-19, 20:51
+
Michael Segel 2012-12-19, 21:02
+
David Arthur 2012-12-19, 21:26
+
Nick Dimiduk 2012-12-19, 22:15
+
Andrew Purtell 2012-12-19, 22:28
+
David Arthur 2012-12-19, 23:04
+
Alex Baranau 2012-12-19, 23:07
+
Michael Segel 2012-12-20, 01:09
+
Michael Segel 2012-12-20, 01:02
+
Jean-Marc Spaggiari 2012-12-20, 01:11
+
Michael Segel 2012-12-20, 01:23
+
Jean-Marc Spaggiari 2012-12-20, 01:35
+
Michel Segel 2012-12-20, 01:47
+
lars hofhansl 2012-12-20, 02:06
Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Michael Segel 2012-12-20, 13:20
Lars,

Ok... he's talking about buckets.

So when you have N buckets, what is the least number of get()s do you need to fetch the single row?
(Hint: The answer is N)

How many scans? (N again)

Do you disagree?
On Dec 19, 2012, at 8:06 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Mike, please think about what you write before you write it.
> You will most definitely not need a full table scan (much less a *FULL* *TABLE* *SCAN* ;-) ).
>
> Read Alex's blog post again, it's a good post (IMHO). He is talking about buckets.
>
>
> -- Lars
>
>
>
> ________________________________
> From: Michael Segel <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Wednesday, December 19, 2012 5:23 PM
> Subject: Re: Is it necessary to set MD5 on rowkey?
>
> Ok,
>
> Lets try this one more time...
>
> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to retrieve the row.
> If you do something like a salt that uses only  a preset of N combinations, you will have to do N get()s in order to fetch the row.
>
> This is bad. VERY BAD.
>
> If you hash the row, you will get a consistent value each time you hash the key.  If you use SHA-1, the odds of a collision are mathematically possible, however highly improbable. So people have recommended that they append the key to the hash to form the new key. Here, you might as well as truncate the hash to just the most significant byte or two and the append the key. This will give you enough of an even distribution that you can avoid hot spotting.
>
> So if I use the hash, I can effectively still get the row of data back with a single get(). Otherwise its a full table scan.
>
> Do you see the difference?
>
>
> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]> wrote:
>
>> Hi Mike,
>>
>> If in your business case, the only thing you need when you retreive
>> your data is to do full scan over MR jobs, then you can salt with
>> what-ever you want. Hash, random values, etc.
>>
>> If you know you have x regions, then you can simply do a round-robin
>> salting, or a random salting over those x regions.
>>
>> Then when you run your MR job, you discard the first bytes, and do
>> what you want with your data.
>>
>> So I also think that salting can still be usefull. All depend on what
>> you do with your data.
>>
>> Must my opinion.
>>
>> JM
>>
>> 2012/12/19, Michael Segel <[EMAIL PROTECTED]>:
>>> Ok...
>>>
>>> So you use a random byte or two at the front of the row.
>>> How do you then use get() to find the row?
>>> How do you do a partial scan()?
>>>
>>> Do you start to see the problem?
>>> The only way to get to the row is to do a full table scan. That kills HBase
>>> and you would be better off going with a partitioned Hive table.
>>>
>>> Using a hash of the key or a portion of the hash is not a salt.
>>> That's not what I have a problem with. Each time you want to fetch the key,
>>> you just hash it, truncate the hash and then prepend it to the key. You will
>>> then be able to use get().
>>>
>>> Using a salt would imply using some form of a modulo math to get a round
>>> robin prefix.  Or a random number generator.
>>>
>>> That's the issue.
>>>
>>> Does that make sense?
>>>
>>>
>>>
>>> On Dec 19, 2012, at 3:26 PM, David Arthur <[EMAIL PROTECTED]> wrote:
>>>
>>>> Let's say you want to decompose a url into domain and path to include in
>>>> your row key.
>>>>
>>>> You could of course just use the url as the key, but you will see
>>>> hotspotting since most will start with "http". To mitigate this, you could
>>>> add a random byte or two at the beginning (random salt) to improve
>>>> distribution of keys, but you break single record Gets (and Scans
>>>> arguably). Another approach is to use a hash-based salt: hash the whole
>>>> key and use a few of those bytes as a salt. This fixes Gets but Scans are
>>>> still not effective.
>>>>
>>>> One approach I've taken is to hash only a part of the key. Consider the
>>>> following key structure
+
Nick Dimiduk 2012-12-20, 18:15
+
Michael Segel 2012-12-20, 20:15
+
k8 robot 2013-02-06, 01:46