Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Questions on FuzzyRowFilter


Copy link to this message
-
Re: Questions on FuzzyRowFilter
Using 4 random bytes you'll get 2^32 possibilities; thus your data can be
split enough among all the possible regions, but you won't be able to
easily benefit from distributed scans to gather what you want.

Let say you want to split (time+login) with a salted key and you expect to
be able to retrieve events from 20140429 pretty fast. Then I would split
input data among 10 "spans", spread over 10 regions and 10 RS (ie: `$random
% 10'). To retrieve ordered data, I would parallelize Scans over the 10
span groups (<00>-20140429, <01>-20140429...) and merge-sort everything
until I've got all the expected results.

So in term of performances this looks "a little bit" faster than your 2^32
randomization.
On Fri, May 2, 2014 at 10:09 PM, Software Dev <[EMAIL PROTECTED]>wrote:
Adrien Mogenet
http://www.borntosegfault.com