Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Can I specify the range inside of fuzzy rule in FuzzyRowFilter?


+
Alex Baranau 2012-08-17, 20:42
+
anil gupta 2012-08-17, 21:34
Copy link to this message
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
What row keys are you skipping?

Using your example...
You have a start row of 00000000200, and an end key of xFFxFFxFFxFFxFFxFF00350.
Note that you could also write that end key as xFF(1..6) 01 since it looks like you're trying to match the 00 in positons 7 and 8 of your numeric string.

Assuming that when you say ? you mean that you expect to have a character in that spot and that your row key is exactly 11 characters in length.

While you may not return all the rows in that range, you do have to still check the row key, unless I am missing something.

So what am I missing?

On Aug 17, 2012, at 3:42 PM, Alex Baranau <[EMAIL PROTECTED]> wrote:

> There was a question [1] in
> https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> more sense to answer it here.
>
> With the current FuzzyRowFilter I believe the only way to approach the
> problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201,
> ..., ??????00350.
>
> As for performance of this approach I can say the following:
> * there are two "checks" happening for each processed row key (i.e. those
> row keys we don't skip)
> * first one performs simple check if the given row key satisfies the fuzzy
> rule and also determines if there's next row key to advance to (if this one
> doesn't satisfy). The check takes up at max O(n), where n is the length of
> fuzzy rule. I.e. this is done in one simple loop, which can be broken
> before all bytes are checked. For m rules this will be O(m*n).
> * second piece calculates the next row key to provide it as a hint for
> fast-forwarding. We again check all rules and finding the smallest hint.
> Operation is also done in one loop, i.e. O(m*n) here as well.
>
> With 150 fuzzy rules of length 11, the applying filter is equivalent to the
> loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
> lot, but can work quite fast. So I'd just try it.
>
> As for extension which will be more efficient, it makes sense to consider
> implementing it. Let me think more about it and get back with the JIRA
> Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first.
> The output (performance) would give us some food for thinking, or may be
> even turns out to be acceptable for you (hopefully).
>
>> Can i run this kind of filter on HBase0.92 without doing any significant
> update to the cluster
>
> Until the next release, you'll have to use the FuzzyRowFilter as any other
> custom filter. Just grab the patch from HBASE-6509 and copy the filter. No
> need to patch & rebuild HBase.
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1]
>
> Anil Gupta added a comment - 18/Aug/12 04:37
> Hi Alex,
> I have a question related to this filter. I have a similar filtering
> requirement which will be an extension to FuzzyFilterRow.
> Suppose, i have the following structure of rowkeys: userid_actionid, where
> userid is of 6 digit and then actionid is 5 digit. I would like to get all
> the rows with actionid between 00200 to 00350. With current FuzzyRowFilter
> i can search for all the rows a particular actionid. Instead of searching
> for a particular actionid i would like to search for a range of actionid.
> Does this use case sounds like an extension to current FuzzyRowFilter? Can
> i run this kind of filter on HBase0.92 without doing any significant update
> to the cluster. If i develop this kind of filter then what is needed to run
> it on all the RS's?
> Thanks,
> Anil
+
Alex Baranau 2012-08-18, 19:13
+
anil gupta 2012-08-18, 21:02
+
Alex Baranau 2012-08-20, 20:07
+
anil gupta 2012-08-22, 06:18
+
Alex Baranau 2012-08-22, 22:41