Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Is it necessary to set MD5 on rowkey?


Copy link to this message
-
Re: Is it necessary to set MD5 on rowkey?
Jean-Marc Spaggiari 2012-12-20, 01:35
I have to disagree with the *FULL* *TABLE* *SCAN* in order to retrieve the row.

If I know that I have one byte salting between 1 and 10, I will have
to do 10 gets to get the row. And they will most probably all be on
different RS, so it will not be more than 1 get per server. This will
take almost the same time as doing a simple get.

I understand your point that salting is inducting some bad things, but
on the other side, it's easy and can still be usefull. Hash will allow
you a direct access with one call, but you still need to calculate the
hash. So what's faster? Calculate the hash and do one call to one
server? Or go directly with one call to multiple servers? It all
depend on the way you access your data.

Personnaly, I'm using hash almost everwhere, but I still understand
that some people might be able to use salting for their specific
purposes.

JM

2012/12/19, Michael Segel <[EMAIL PROTECTED]>:
> Ok,
>
> Lets try this one more time...
>
> If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
> retrieve the row.
> If you do something like a salt that uses only  a preset of N combinations,
> you will have to do N get()s in order to fetch the row.
>
> This is bad. VERY BAD.
>
> If you hash the row, you will get a consistent value each time you hash the
> key.  If you use SHA-1, the odds of a collision are mathematically possible,
> however highly improbable. So people have recommended that they append the
> key to the hash to form the new key. Here, you might as well as truncate the
> hash to just the most significant byte or two and the append the key. This
> will give you enough of an even distribution that you can avoid hot
> spotting.
>
> So if I use the hash, I can effectively still get the row of data back with
> a single get(). Otherwise its a full table scan.
>
> Do you see the difference?
>
>
> On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> wrote:
>
>> Hi Mike,
>>
>> If in your business case, the only thing you need when you retreive
>> your data is to do full scan over MR jobs, then you can salt with
>> what-ever you want. Hash, random values, etc.
>>
>> If you know you have x regions, then you can simply do a round-robin
>> salting, or a random salting over those x regions.
>>
>> Then when you run your MR job, you discard the first bytes, and do
>> what you want with your data.
>>
>> So I also think that salting can still be usefull. All depend on what
>> you do with your data.
>>
>> Must my opinion.
>>
>> JM
>>
>> 2012/12/19, Michael Segel <[EMAIL PROTECTED]>:
>>> Ok...
>>>
>>> So you use a random byte or two at the front of the row.
>>> How do you then use get() to find the row?
>>> How do you do a partial scan()?
>>>
>>> Do you start to see the problem?
>>> The only way to get to the row is to do a full table scan. That kills
>>> HBase
>>> and you would be better off going with a partitioned Hive table.
>>>
>>> Using a hash of the key or a portion of the hash is not a salt.
>>> That's not what I have a problem with. Each time you want to fetch the
>>> key,
>>> you just hash it, truncate the hash and then prepend it to the key. You
>>> will
>>> then be able to use get().
>>>
>>> Using a salt would imply using some form of a modulo math to get a round
>>> robin prefix.  Or a random number generator.
>>>
>>> That's the issue.
>>>
>>> Does that make sense?
>>>
>>>
>>>
>>> On Dec 19, 2012, at 3:26 PM, David Arthur <[EMAIL PROTECTED]> wrote:
>>>
>>>> Let's say you want to decompose a url into domain and path to include
>>>> in
>>>> your row key.
>>>>
>>>> You could of course just use the url as the key, but you will see
>>>> hotspotting since most will start with "http". To mitigate this, you
>>>> could
>>>> add a random byte or two at the beginning (random salt) to improve
>>>> distribution of keys, but you break single record Gets (and Scans
>>>> arguably). Another approach is to use a hash-based salt: hash the whole
>>>> key and use a few of those bytes as a salt. This fixes Gets but Scans