Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Custom Filter and SEEK_NEXT_USING_HINT issue


+
Eugeny Morozov 2013-01-18, 23:28
+
Ted Yu 2013-01-18, 23:56
+
Eugeny Morozov 2013-01-19, 09:36
+
Ted 2013-01-19, 13:16
+
Eugeny Morozov 2013-01-20, 21:22
+
Michael Segel 2013-01-21, 00:22
Copy link to this message
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Eugeny Morozov 2013-01-21, 08:16
Finally, the mystery has been solved.

Small remark before I explain everything.

The situation with only region is absolutely the same:
Fzzy: AAAA1Q7iQ9JA
Next fzzy: F7dtxwqVQ_Pw  <-- the value I'm trying to find.
Fzzy: F7dt8QWPSIDw
Somehow FuzzyRowFilter has just omit my value here.
So, the explanation.
In javadoc for FuzzyRowFilter question mark is used as substitution for
unknown value. Of course it's possible to use anything including zero
instead of question mark.
For quite some time we used literals to encode our keys. Literals like
you've seen already: AAAA1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form
of just 8 bytes, which requires 1.5 times more space. So we've decided to
store raw version - just  byte[8]. But unfortunately the symbol '?' is
exactly in the middle of the byte (according to ascii table
http://www.asciitable.com/), which means with FuzzyRowFilter we skip half
of values in some cases. In the same time question mark is exactly before
any letter that could be used in key.

Despite the fact we have integration tests - that's just a coincidence we
haven't such an example in there.

So, as an advice - always use zero instead of question mark for
FuzzyRowFilter.

Thank's to everyone!

P.S. But the question with region scanning order is still here. I do not
understand why with FuzzyFilter it goes from one region to another until it
stops at the value. I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.
On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel <[EMAIL PROTECTED]>wrote:

> If its the same class and its not a patch, then the first class loaded
> wins.
>
> So if you have a Class Foo and HBase has a Class Foo, your code will never
> see the light of day.
>
> Perhaps I'm stating the obvious but its something to think about when
> working w Hadoop.
>
> On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <[EMAIL PROTECTED]>
> wrote:
>
> > Ted,
> >
> > that is correct.
> > HBase 0.92.x and we use part of the patch 6509.
> >
> > I use the filter as a custom filter, it lives in separate jar file and
> goes
> > to HBase's classpath. I did not patch HBase.
> > Moreover I do not use protobuf's descriptions that comes with the filter
> in
> > patch. Only two classes I have - FuzzyRowFilter itself and its test
> class.
> >
> > And it works perfectly on small dataset like 100 rows (1 region). But
> when
> > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm
> > not sure, but it seems to me it is not fault of the filter.
> >
> >
> > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x
> >>
> >> Looks like you were using patch from HBASE-6509 which was integrated to
> >> trunk only.
> >> Please confirm.
> >>
> >> Copying Alex who wrote the patch.
> >>
> >> Cheers
> >>
> >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov
> >> <[EMAIL PROTECTED]>wrote:
> >>
> >>> Hi, folks!
> >>>
> >>> HBase, Hadoop, etc version is CDH-4.1.2
> >>>
> >>> I'm using custom FuzzyRowFilter, which I get from
> >>>
> >>>
> >>
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and
> >>> suddenly after quite a time we found that it starts loosing data.
> >>>
> >>> Basically the idea of FuzzyRowFilter is that it tries to find key that
> >> has
> >>> been provided and if there is no such a key - but more exists in table
> -
> >> it
> >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds
> >> required
> >>> key. As I understand, HBase in this key will fast-forward to required
> >> key -
> >>> it must be similar or same as to get Scan with setStartRow.
> >>>
> >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm
> able
> >>> to get it using Scan.setStartRow.
> >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]
+
ramkrishna vasudevan 2013-01-21, 08:56
+
Anoop Sam John 2013-01-21, 08:59
+
Eugeny Morozov 2013-01-21, 11:44