Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Is it necessary to set MD5 on rowkey?

Copy link to this message
Re: Is it necessary to set MD5 on rowkey?
Jean-Marc Spaggiari 2012-12-20, 01:11
Hi Mike,

If in your business case, the only thing you need when you retreive
your data is to do full scan over MR jobs, then you can salt with
what-ever you want. Hash, random values, etc.

If you know you have x regions, then you can simply do a round-robin
salting, or a random salting over those x regions.

Then when you run your MR job, you discard the first bytes, and do
what you want with your data.

So I also think that salting can still be usefull. All depend on what
you do with your data.

Must my opinion.


2012/12/19, Michael Segel <[EMAIL PROTECTED]>:
> Ok...
> So you use a random byte or two at the front of the row.
> How do you then use get() to find the row?
> How do you do a partial scan()?
> Do you start to see the problem?
> The only way to get to the row is to do a full table scan. That kills HBase
> and you would be better off going with a partitioned Hive table.
> Using a hash of the key or a portion of the hash is not a salt.
> That's not what I have a problem with. Each time you want to fetch the key,
> you just hash it, truncate the hash and then prepend it to the key. You will
> then be able to use get().
> Using a salt would imply using some form of a modulo math to get a round
> robin prefix.  Or a random number generator.
> That's the issue.
> Does that make sense?
> On Dec 19, 2012, at 3:26 PM, David Arthur <[EMAIL PROTECTED]> wrote:
>> Let's say you want to decompose a url into domain and path to include in
>> your row key.
>> You could of course just use the url as the key, but you will see
>> hotspotting since most will start with "http". To mitigate this, you could
>> add a random byte or two at the beginning (random salt) to improve
>> distribution of keys, but you break single record Gets (and Scans
>> arguably). Another approach is to use a hash-based salt: hash the whole
>> key and use a few of those bytes as a salt. This fixes Gets but Scans are
>> still not effective.
>> One approach I've taken is to hash only a part of the key. Consider the
>> following key structure
>> <2 bytes of hash(domain)><domain><path>
>> With this you get 16 bits for a hash-based salt. The salt is deterministic
>> so Gets work fine, and for a single domain the salt is the same so you can
>> easily do Scans across a domain. If you had some further structure to your
>> key that you wished to scan across, you could do something like:
>> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path>
>> It really boils down to identifying your access patterns and read/write
>> requirements and constructing a row key accordingly.
>> HTH,
>> David
>> On 12/18/12 6:29 PM, Michael Segel wrote:
>>> Alex,
>>> And that's the point. Salt as you explain it conceptually implies that
>>> the number you are adding to the key to ensure a better distribution
>>> means that you will have inefficiencies in terms of scans and gets.
>>> Using a hash as either the full key, or taking the hash, truncating it
>>> and appending the key may screw up scans, but your get() is intact.
>>> There are other options like inverting the numeric key ...
>>> And of course doing nothing.
>>> Using a salt as part of the design pattern is bad.
>>> With respect to the OP, I was discussing the use of hash and some
>>> alternatives to how to implement the hash of a key.
>>> Again, doing nothing may also make sense too, if you understand the risks
>>> and you know how your data is going to be used.
>>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[EMAIL PROTECTED]>
>>> wrote:
>>>> Mike,
>>>> Please read *full post* before judge. In particular, "Hash-based
>>>> distribution" section. You can find the same in HBaseWD small README
>>>> file
>>>> [1] (not sure if you read it at all before commenting on the lib).
>>>> Round
>>>> robin is mainly for explaining the concept/idea (though not only for
>>>> that).
>>>> Thank you,
>>>> Alex Baranau
>>>> ------
>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch