Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Custom Filter and SEEK_NEXT_USING_HINT issue


+
Eugeny Morozov 2013-01-18, 23:28
+
Ted Yu 2013-01-18, 23:56
+
Eugeny Morozov 2013-01-19, 09:36
+
Ted 2013-01-19, 13:16
+
Eugeny Morozov 2013-01-20, 21:22
+
Michael Segel 2013-01-21, 00:22
+
Eugeny Morozov 2013-01-21, 08:16
+
ramkrishna vasudevan 2013-01-21, 08:56
Copy link to this message
-
RE: Custom Filter and SEEK_NEXT_USING_HINT issue
Anoop Sam John 2013-01-21, 08:59
> I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.

@Eugeny -  FuzzyFilter like any other filter works at the server side. The scanning from client side will be like sequential starting from the 1st region (Region with empty startkey or the corresponding region which contains the startkey whatever you mentioned in your scan). From client, request will go to RS for scanning a region. Once that region is over the next region will be contacted for scan(from client) and so on.  There is no parallel scanning of multiple regions from client side.  [This is when using a HTable scan APIs]

When MR used for scanning, we will be doing parallel scans from all the regions. Here will be having mappers per region.  But the normal scan from client side will be sequential on the regions not parallel.

-Anoop-
________________________________________
From: Eugeny Morozov [[EMAIL PROTECTED]]
Sent: Monday, January 21, 2013 1:46 PM
To: [EMAIL PROTECTED]
Cc: Alex Baranau
Subject: Re: Custom Filter and SEEK_NEXT_USING_HINT issue

Finally, the mystery has been solved.

Small remark before I explain everything.

The situation with only region is absolutely the same:
Fzzy: AAAA1Q7iQ9JA
Next fzzy: F7dtxwqVQ_Pw  <-- the value I'm trying to find.
Fzzy: F7dt8QWPSIDw
Somehow FuzzyRowFilter has just omit my value here.
So, the explanation.
In javadoc for FuzzyRowFilter question mark is used as substitution for
unknown value. Of course it's possible to use anything including zero
instead of question mark.
For quite some time we used literals to encode our keys. Literals like
you've seen already: AAAA1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form
of just 8 bytes, which requires 1.5 times more space. So we've decided to
store raw version - just  byte[8]. But unfortunately the symbol '?' is
exactly in the middle of the byte (according to ascii table
http://www.asciitable.com/), which means with FuzzyRowFilter we skip half
of values in some cases. In the same time question mark is exactly before
any letter that could be used in key.

Despite the fact we have integration tests - that's just a coincidence we
haven't such an example in there.

So, as an advice - always use zero instead of question mark for
FuzzyRowFilter.

Thank's to everyone!

P.S. But the question with region scanning order is still here. I do not
understand why with FuzzyFilter it goes from one region to another until it
stops at the value. I suppose if scanning process has started at once on
all regions, then I would find in log files at least one value per region,
but I have found one value per region only for those regions, that resides
before the particular one.
On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel <[EMAIL PROTECTED]>wrote:

> If its the same class and its not a patch, then the first class loaded
> wins.
>
> So if you have a Class Foo and HBase has a Class Foo, your code will never
> see the light of day.
>
> Perhaps I'm stating the obvious but its something to think about when
> working w Hadoop.
>
> On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <[EMAIL PROTECTED]>
> wrote:
>
> > Ted,
> >
> > that is correct.
> > HBase 0.92.x and we use part of the patch 6509.
> >
> > I use the filter as a custom filter, it lives in separate jar file and
> goes
> > to HBase's classpath. I did not patch HBase.
> > Moreover I do not use protobuf's descriptions that comes with the filter
> in
> > patch. Only two classes I have - FuzzyRowFilter itself and its test
> class.
> >
> > And it works perfectly on small dataset like 100 rows (1 region). But
> when
> > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm
> > not sure, but it seems to me it is not fault of the filter.
> >
> >
> > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
[EMAIL PROTECTED]
+
Eugeny Morozov 2013-01-21, 11:44