Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: row filter - binary comparator at certain range


Copy link to this message
-
Re: row filter - binary comparator at certain range
James Taylor 2013-10-21, 22:07
We don't truncate the hash, we mod it. Why would you expect that data
wouldn't be evenly distributed? We've not seen this to be the case.

On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> What do you call hashing the row key?
> Or hashing the row key and then appending the row key to the hash?
> Or hashing the row key, truncating the hash value to some subset and then
> appending the row key to the value?
>
> The problem is that there is specific meaning to the term salt. Re-using
> it here will cause confusion because you're implying something you don't
> mean to imply.
>
> you could say prepend a truncated hash of the key, however… is prepend a
> real word? ;-) (I am sorry, I am not a grammar nazi, nor an English major. )
>
> So even outside of Phoenix, the concept is the same.
> Even with a truncated hash, you will find that over time, all but the tail
> N regions will only be half full.
> This could be both good and bad.
>
> (Where N is your number 8 or 16 allowable hash values.)
>
> You've solved potentially one problem… but still have other issues that
> you need to address.
> I guess the simple answer is to double the region sizes and not care that
> most of your regions will be 1/2 the max size…  but the size you really
> want and 8-16 regions will be up to twice as big.
>
>
>
> On Oct 21, 2013, at 3:26 PM, James Taylor <[EMAIL PROTECTED]> wrote:
>
> > What do you think it should be called, because
> > "prepending-row-key-with-single-hashed-byte" doesn't have a very good
> ring
> > to it. :-)
> >
> > Agree that getting the row key design right is crucial.
> >
> > The range of "prepending-row-key-with-single-hashed-byte" is declarative
> > when you create your table in Phoenix, so you typically declare an upper
> > bound based on your cluster size (not 255, but maybe 8 or 16). We've run
> > the numbers and it's typically faster, but as with most things, not
> always.
> >
> > HTH,
> > James
> >
> >
> > On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel <
> [EMAIL PROTECTED]>wrote:
> >
> >> Then its not a SALT. And please don't use the term 'salt' because it has
> >> specific meaning outside to what you want it to mean.  Just like saying
> >> HBase has ACID because you write the entire row as an atomic element.
>  But
> >> I digress….
> >>
> >> Ok so to your point…
> >>
> >> 1 byte == 255 possible values.
> >>
> >> So which will be faster.
> >>
> >> creating a list of the 1 byte truncated hash of each possible timestamp
> in
> >> your range, or doing 255 separate range scans with the start and stop
> range
> >> key set?
> >>
> >> That will give you the results you want, however… I'd go back and have
> >> them possibly rethink the row key if they can … assuming this is the
> base
> >> access pattern.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >>
> >>
> >>
> >>
> >> On Oct 21, 2013, at 11:37 AM, James Taylor <[EMAIL PROTECTED]>
> wrote:
> >>
> >>> Phoenix restricts salting to a single byte.
> >>> Salting perhaps is misnamed, as the salt byte is a stable hash based on
> >> the
> >>> row key.
> >>> Phoenix's skip scan supports sub-key ranges.
> >>> We've found salting in general to be faster (though there are cases
> where
> >>> it's not), as it ensures better parallelization.
> >>>
> >>> Regards,
> >>> James
> >>>
> >>>
> >>>
> >>> On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
> >>> <[EMAIL PROTECTED]>wrote:
> >>>
> >>>> FuzzyRowFilter does not work on sub-key ranges.
> >>>> Salting is bad for any scan operation, unfortunately. When salt prefix
> >>>> cardinality is small (1-2 bytes),
> >>>> one can try something similar to FuzzyRowFilter but with additional
> >>>> sub-key range support.
> >>>> If salt prefix cardinality is high (> 2 bytes) - do a full scan with
> >> your
> >>>> own Filter (for timestamp ranges).
> >>>>
> >>>> Best regards,
> >>>> Vladimir Rodionov
> >>>> Principal Platform Engineer
> >>>> Carrier IQ, www.carrieriq.com
> >>>> e-mail: [EMAIL PROTECTED]