Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Region Splits

The downside of hashing is not that it's unpredictable, but that it's
non-reversible (which is why you need to append the original key).
Reversing should be fine, just make sure that you performa a byte-order
reversal so that you have uniform distribution.

On 11/22/11 7:47 PM, "Mark" <[EMAIL PROTECTED]> wrote:

>Ok so this would be "short scans"?
>In my use case this would be unnecessary so I think Im going to run with
>the reversed id technique. I'm actually surprised I've never heard of
>anyone using this over the non predictable hashing.
>On 11/22/11 5:35 PM, Sam Seigal wrote:
>> If you are prefixing your keys with predictable hashes, you can do
>> range scans - i.e. create a scanner for each prefix and then merge
>> results at the client. With unpredictable hashes and key reversals ,
>> this might not be entirely possible.
>> I remember someone on the mailing list mentioning that Mozilla Socorro
>> uses a similar technique. I haven't had a chance to look at their code
>> yet, but that is something you might want to look at.
>> On Tue, Nov 22, 2011 at 5:11 PM, Mark<[EMAIL PROTECTED]>  wrote:
>>> What to you mean by "short scans"?
>>> I understand that scans will not be possible with this method but
>>> would they be if I hashed them so it seems like I'm in the same boat
>>> On 11/22/11 5:00 PM, Amandeep Khurana wrote:
>>>> Mark
>>>> Key designs depend on expected access patterns and use cases. From a
>>>> theoretical stand point, what you are saying will work to distribute
>>>> writes but if you want to access a small range, you'll need to fan out
>>>> your reads and can't leverage short scans.
>>>> Amandeep
>>>> On Nov 22, 2011, at 4:55 PM, Mark<[EMAIL PROTECTED]>    wrote:
>>>>> I just thought of something.
>>>>> In cases where the id is sequential couldn't one simply reverse the
>>>>>id to
>>>>> get more of a uniform distribution?
>>>>> 510911 =>    119015
>>>>> 510912 =>    219015
>>>>> 510913 =>    319015
>>>>> 510914 =>    419015
>>>>> That seems like a reasonable alternative that doesn't require
>>>>> each row key with an extra 16 bytes. Am I wrong in thinking this
>>>>>could work?
>>>>> On 11/22/11 12:46 PM, Nicolas Spiegelberg wrote:
>>>>>> If you increase the region size to 2GB, then all regions (current
>>>>>> new)
>>>>>> will avoid a split until their aggregate StoreFile size reaches that
>>>>>> limit.  Reorganizing the regions for a uniform growth pattern is
>>>>>> a
>>>>>> schema design problem.  There is the capability to merge two
>>>>>> regions if you know that your data growth pattern is non-uniform.
>>>>>> StumbleUpon&     other companies have more experience with those
>>>>>> than I do.
>>>>>> Note: With the introduction of HFileV2 in 0.92, you'll definitely
>>>>>> to
>>>>>> lean towards increasing the region size.  HFile scalability code is
>>>>>> mature/stable than the region splitting code.  Plus, automatic
>>>>>> splitting is harder to optimize&     debug when failures occur.
>>>>>> On 11/22/11 12:20 PM, "Srikanth P. Shreenivas"
>>>>>> <[EMAIL PROTECTED]>     wrote:
>>>>>>> Thanks Nicolas for the clarification.  I had a follow-up query.
>>>>>>> What will happen if we increased the region size, say from current
>>>>>>> value
>>>>>>> of 256 MB to a new value of 2GB?
>>>>>>> Will existing regions continue to use only 256 MB space?
>>>>>>> Is there a way to reorganize the regions so that each regions
>>>>>>>grows to
>>>>>>> 2GB size?
>>>>>>> Thanks,
>>>>>>> Srikanth
>>>>>>> -----Original Message-----
>>>>>>> From: Nicolas Spiegelberg [mailto:[EMAIL PROTECTED]]
>>>>>>> Sent: Tuesday, November 22, 2011 10:59 PM
>>>>>>> Subject: Re: Region Splits
>>>>>>> No.  The purpose of major compactions is to merge&     dedupe