Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: row filter - binary comparator at certain range


Copy link to this message
-
Re: row filter - binary comparator at certain range
Michael Segel 2013-10-22, 00:58
James,

Its evenly distributed, however... because its a time stamp, its a 'tail end charlie' addition.  
So when you split a region, the top half is never added to, so you end up with all regions half filled except for the last region in each 'modded' value.  

I wouldn't say its a bad thing if you plan for it.

On Oct 21, 2013, at 5:07 PM, James Taylor <[EMAIL PROTECTED]> wrote:

> We don't truncate the hash, we mod it. Why would you expect that data
> wouldn't be evenly distributed? We've not seen this to be the case.
>
>
>
> On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> What do you call hashing the row key?
>> Or hashing the row key and then appending the row key to the hash?
>> Or hashing the row key, truncating the hash value to some subset and then
>> appending the row key to the value?
>>
>> The problem is that there is specific meaning to the term salt. Re-using
>> it here will cause confusion because you're implying something you don't
>> mean to imply.
>>
>> you could say prepend a truncated hash of the key, however… is prepend a
>> real word? ;-) (I am sorry, I am not a grammar nazi, nor an English major. )
>>
>> So even outside of Phoenix, the concept is the same.
>> Even with a truncated hash, you will find that over time, all but the tail
>> N regions will only be half full.
>> This could be both good and bad.
>>
>> (Where N is your number 8 or 16 allowable hash values.)
>>
>> You've solved potentially one problem… but still have other issues that
>> you need to address.
>> I guess the simple answer is to double the region sizes and not care that
>> most of your regions will be 1/2 the max size…  but the size you really
>> want and 8-16 regions will be up to twice as big.
>>
>>
>>
>> On Oct 21, 2013, at 3:26 PM, James Taylor <[EMAIL PROTECTED]> wrote:
>>
>>> What do you think it should be called, because
>>> "prepending-row-key-with-single-hashed-byte" doesn't have a very good
>> ring
>>> to it. :-)
>>>
>>> Agree that getting the row key design right is crucial.
>>>
>>> The range of "prepending-row-key-with-single-hashed-byte" is declarative
>>> when you create your table in Phoenix, so you typically declare an upper
>>> bound based on your cluster size (not 255, but maybe 8 or 16). We've run
>>> the numbers and it's typically faster, but as with most things, not
>> always.
>>>
>>> HTH,
>>> James
>>>
>>>
>>> On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel <
>> [EMAIL PROTECTED]>wrote:
>>>
>>>> Then its not a SALT. And please don't use the term 'salt' because it has
>>>> specific meaning outside to what you want it to mean.  Just like saying
>>>> HBase has ACID because you write the entire row as an atomic element.
>> But
>>>> I digress….
>>>>
>>>> Ok so to your point…
>>>>
>>>> 1 byte == 255 possible values.
>>>>
>>>> So which will be faster.
>>>>
>>>> creating a list of the 1 byte truncated hash of each possible timestamp
>> in
>>>> your range, or doing 255 separate range scans with the start and stop
>> range
>>>> key set?
>>>>
>>>> That will give you the results you want, however… I'd go back and have
>>>> them possibly rethink the row key if they can … assuming this is the
>> base
>>>> access pattern.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Oct 21, 2013, at 11:37 AM, James Taylor <[EMAIL PROTECTED]>
>> wrote:
>>>>
>>>>> Phoenix restricts salting to a single byte.
>>>>> Salting perhaps is misnamed, as the salt byte is a stable hash based on
>>>> the
>>>>> row key.
>>>>> Phoenix's skip scan supports sub-key ranges.
>>>>> We've found salting in general to be faster (though there are cases
>> where
>>>>> it's not), as it ensures better parallelization.
>>>>>
>>>>> Regards,
>>>>> James
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
>>>>> <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>> FuzzyRowFilter does not work on sub-key ranges.
>>>>>> Salting is bad for any scan operation, unfortunately. When salt prefix

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com