Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Can I specify the range inside of fuzzy rule in FuzzyRowFilter?


Copy link to this message
-
Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?
anil gupta 2012-08-17, 21:34
Hi Alex,

Thanks for the answer. I have successfully compiled FuzzyRowFilter class
with HBase0.92. To try out FuzzyRowFilter, i'll need to make some changes
to my RowKey. So, i'll get back to you with performance numbers after
loading the data and trying out FuzzyRowFilter for a particular value.

The range example i told in my original post is very small. In my real use
case the range can lie from 0 to 31536000. So, in my opinion using the
current FuzzyRowFilter might not be a good idea. I agree with you that
extension is the right way for solving this.

Here is my real use case :
I have a table in which is store event from customers using
customerid+timestamp.
Sample Query: I want to get all the event which happened in last month.
Current Possible Solutions:
1. I can do this filtering by using a filter checking the column value of
"timestamp" column. I think this will be highly inefficient.
2. Other possible way i think is to use RegexComparator with RowFilter to
get all the row with a certain numeric range of timestamp. In this case
also every rowkey of the table will be checked.

So, the most optimum way is to use something like FuzzyRowFilter with
range. Also, my range will always be numerical and this can be really handy
for others storing timestamp in the rowkey and wants to do time based
queries using the RowKey.

Thanks,
Anil Gupta
On Fri, Aug 17, 2012 at 1:42 PM, Alex Baranau <[EMAIL PROTECTED]>wrote:

> There was a question [1] in
> https://issues.apache.org/jira/browse/HBASE-6509JIRA comment, it makes
> more sense to answer it here.
>
> With the current FuzzyRowFilter I believe the only way to approach the
> problem is to add 150 fuzzy rules to the filter: ??????00200, ??????00201,
> ..., ??????00350.
>
> As for performance of this approach I can say the following:
> * there are two "checks" happening for each processed row key (i.e. those
> row keys we don't skip)
> * first one performs simple check if the given row key satisfies the fuzzy
> rule and also determines if there's next row key to advance to (if this one
> doesn't satisfy). The check takes up at max O(n), where n is the length of
> fuzzy rule. I.e. this is done in one simple loop, which can be broken
> before all bytes are checked. For m rules this will be O(m*n).
> * second piece calculates the next row key to provide it as a hint for
> fast-forwarding. We again check all rules and finding the smallest hint.
> Operation is also done in one loop, i.e. O(m*n) here as well.
>
> With 150 fuzzy rules of length 11, the applying filter is equivalent to the
> loop with simple checks thru 150*11*2 ~ 3000 elements. This might look a
> lot, but can work quite fast. So I'd just try it.
>
> As for extension which will be more efficient, it makes sense to consider
> implementing it. Let me think more about it and get back with the JIRA
> Issue to you :). But I'd suggest you trying existing FuzzyRowFilter first.
> The output (performance) would give us some food for thinking, or may be
> even turns out to be acceptable for you (hopefully).
>
> > Can i run this kind of filter on HBase0.92 without doing any significant
> update to the cluster
>
> Until the next release, you'll have to use the FuzzyRowFilter as any other
> custom filter. Just grab the patch from HBASE-6509 and copy the filter. No
> need to patch & rebuild HBase.
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1]
>
> Anil Gupta added a comment - 18/Aug/12 04:37
> Hi Alex,
> I have a question related to this filter. I have a similar filtering
> requirement which will be an extension to FuzzyFilterRow.
> Suppose, i have the following structure of rowkeys: userid_actionid, where
> userid is of 6 digit and then actionid is 5 digit. I would like to get all
> the rows with actionid between 00200 to 00350. With current FuzzyRowFilter
> i can search for all the rows a particular actionid. Instead of searching
> for a particular actionid i would like to search for a range of actionid.

Thanks & Regards,
Anil Gupta