|
|
-
Custom Filter and SEEK_NEXT_USING_HINT issue
Eugeny Morozov 2013-01-18, 23:28
Hi, folks! HBase, Hadoop, etc version is CDH-4.1.2 I'm using custom FuzzyRowFilter, which I get from http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/andsuddenly after quite a time we found that it starts loosing data. Basically the idea of FuzzyRowFilter is that it tries to find key that has been provided and if there is no such a key - but more exists in table - it returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds required key. As I understand, HBase in this key will fast-forward to required key - it must be similar or same as to get Scan with setStartRow. I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm able to get it using Scan.setStartRow. For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop row or anything related. That's what happening: Fzzy: AAAA1Q7iQ9JA Next fzzy: F7dtxwqVQ_Pw Fzzy: AQAAnA96rxTg Next fzzy: F7dtxwqVQ_Pw Fzzy: AgAADQWPSIDw Next fzzy: F7dtxwqVQ_Pw Fzzy: AwAA-Q33Zb9Q Next fzzy: F7dtxwqVQ_Pw Fzzy: BAAAOg8oyu7A Next fzzy: F7dtxwqVQ_Pw Fzzy: BQAA9gqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: BgABZQ7iQ9JA Next fzzy: F7dtxwqVQ_Pw Fzzy: BwAAbgrpAojg Next fzzy: F7dtxwqVQ_Pw Fzzy: CAAAUQWPSIDw Next fzzy: F7dtxwqVQ_Pw Fzzy: CQABVgqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: CgAAOQ7iQ9JA Next fzzy: F7dtxwqVQ_Pw Fzzy: CwAALwqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: DAAAMwWPSIDw Next fzzy: F7dtxwqVQ_Pw Fzzy: DQAADgjqzsIQ Next fzzy: F7dtxwqVQ_Pw Fzzy: DgAAOgCcWv9g Next fzzy: F7dtxwqVQ_Pw Fzzy: DwAAKg7iQ9JA Next fzzy: F7dtxwqVQ_Pw Fzzy: EAAAugqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: EQAAJAqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: EgAABgIOMBgg Next fzzy: F7dtxwqVQ_Pw Fzzy: EwAAEwqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: FAAACQqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: FQAAIAqVQrTw Next fzzy: F7dtxwqVQ_Pw Fzzy: FgAAeAWPSIDw Next fzzy: F7dtxwqVQ_Pw Fzzy: FwAAAw33Zb9Q Next fzzy: F7dtxwqVQ_Pw Fzzy: F7dt8QWPSIDw It's obvious that my FuzzyRowFilter knows what to search and every time it repeats its question. The very first key - I suppose is just the first key of a region where my key is located. The very last key - is the key that is already bigger than what I'm trying to find - that's the reason why FuzzyFilter stopped there. Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but unsuccessfully. Do you have any idea how to explain these many trials? Thanks in advance. -- Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugeny Morozov 2013-01-18, 23:28
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Ted Yu 2013-01-18, 23:56
To my knowledge CDH-4.1.2 is based on HBase 0.92.x Looks like you were using patch from HBASE-6509 which was integrated to trunk only. Please confirm. Copying Alex who wrote the patch. Cheers On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov <[EMAIL PROTECTED]>wrote: > Hi, folks! > > HBase, Hadoop, etc version is CDH-4.1.2 > > I'm using custom FuzzyRowFilter, which I get from > > http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and> suddenly after quite a time we found that it starts loosing data. > > Basically the idea of FuzzyRowFilter is that it tries to find key that has > been provided and if there is no such a key - but more exists in table - it > returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds required > key. As I understand, HBase in this key will fast-forward to required key - > it must be similar or same as to get Scan with setStartRow. > > I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm able > to get it using Scan.setStartRow. > For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop row > or anything related. > That's what happening: > > Fzzy: AAAA1Q7iQ9JA > Next fzzy: F7dtxwqVQ_Pw > Fzzy: AQAAnA96rxTg > Next fzzy: F7dtxwqVQ_Pw > Fzzy: AgAADQWPSIDw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: AwAA-Q33Zb9Q > Next fzzy: F7dtxwqVQ_Pw > Fzzy: BAAAOg8oyu7A > Next fzzy: F7dtxwqVQ_Pw > Fzzy: BQAA9gqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: BgABZQ7iQ9JA > Next fzzy: F7dtxwqVQ_Pw > Fzzy: BwAAbgrpAojg > Next fzzy: F7dtxwqVQ_Pw > Fzzy: CAAAUQWPSIDw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: CQABVgqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: CgAAOQ7iQ9JA > Next fzzy: F7dtxwqVQ_Pw > Fzzy: CwAALwqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: DAAAMwWPSIDw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: DQAADgjqzsIQ > Next fzzy: F7dtxwqVQ_Pw > Fzzy: DgAAOgCcWv9g > Next fzzy: F7dtxwqVQ_Pw > Fzzy: DwAAKg7iQ9JA > Next fzzy: F7dtxwqVQ_Pw > Fzzy: EAAAugqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: EQAAJAqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: EgAABgIOMBgg > Next fzzy: F7dtxwqVQ_Pw > Fzzy: EwAAEwqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: FAAACQqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: FQAAIAqVQrTw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: FgAAeAWPSIDw > Next fzzy: F7dtxwqVQ_Pw > Fzzy: FwAAAw33Zb9Q > Next fzzy: F7dtxwqVQ_Pw > Fzzy: F7dt8QWPSIDw > > It's obvious that my FuzzyRowFilter knows what to search and every time it > repeats its question. > The very first key - I suppose is just the first key of a region where my > key is located. > The very last key - is the key that is already bigger than what I'm trying > to find - that's the reason why FuzzyFilter stopped there. > > Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but > unsuccessfully. > Do you have any idea how to explain these many trials? > > Thanks in advance. > -- > Evgeny Morozov > Developer Grid Dynamics > Skype: morozov.evgeny > www.griddynamics.com > [EMAIL PROTECTED] >
+
Ted Yu 2013-01-18, 23:56
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Eugeny Morozov 2013-01-19, 09:36
Ted, that is correct. HBase 0.92.x and we use part of the patch 6509. I use the filter as a custom filter, it lives in separate jar file and goes to HBase's classpath. I did not patch HBase. Moreover I do not use protobuf's descriptions that comes with the filter in patch. Only two classes I have - FuzzyRowFilter itself and its test class. And it works perfectly on small dataset like 100 rows (1 region). But when my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm not sure, but it seems to me it is not fault of the filter. On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > To my knowledge CDH-4.1.2 is based on HBase 0.92.x > > Looks like you were using patch from HBASE-6509 which was integrated to > trunk only. > Please confirm. > > Copying Alex who wrote the patch. > > Cheers > > On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov > <[EMAIL PROTECTED]>wrote: > > > Hi, folks! > > > > HBase, Hadoop, etc version is CDH-4.1.2 > > > > I'm using custom FuzzyRowFilter, which I get from > > > > > http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and> > suddenly after quite a time we found that it starts loosing data. > > > > Basically the idea of FuzzyRowFilter is that it tries to find key that > has > > been provided and if there is no such a key - but more exists in table - > it > > returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds > required > > key. As I understand, HBase in this key will fast-forward to required > key - > > it must be similar or same as to get Scan with setStartRow. > > > > I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm able > > to get it using Scan.setStartRow. > > For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop > row > > or anything related. > > That's what happening: > > > > Fzzy: AAAA1Q7iQ9JA > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: AQAAnA96rxTg > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: AgAADQWPSIDw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: AwAA-Q33Zb9Q > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: BAAAOg8oyu7A > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: BQAA9gqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: BgABZQ7iQ9JA > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: BwAAbgrpAojg > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: CAAAUQWPSIDw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: CQABVgqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: CgAAOQ7iQ9JA > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: CwAALwqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: DAAAMwWPSIDw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: DQAADgjqzsIQ > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: DgAAOgCcWv9g > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: DwAAKg7iQ9JA > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: EAAAugqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: EQAAJAqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: EgAABgIOMBgg > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: EwAAEwqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: FAAACQqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: FQAAIAqVQrTw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: FgAAeAWPSIDw > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: FwAAAw33Zb9Q > > Next fzzy: F7dtxwqVQ_Pw > > Fzzy: F7dt8QWPSIDw > > > > It's obvious that my FuzzyRowFilter knows what to search and every time > it > > repeats its question. > > The very first key - I suppose is just the first key of a region where my > > key is located. > > The very last key - is the key that is already bigger than what I'm > trying > > to find - that's the reason why FuzzyFilter stopped there. > > > > Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but > > unsuccessfully. > > Do you have any idea how to explain these many trials? > > > > Thanks in advance. > > -- > > Evgeny Morozov > > Developer Grid Dynamics > > Skype: morozov.evgeny > > www.griddynamics.com > > [EMAIL PROTECTED] > > > -- Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugeny Morozov 2013-01-19, 09:36
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Ted 2013-01-19, 13:16
In your original email you said the first key looked like start key of a region, can you verify that ? Thanks On Jan 19, 2013, at 1:36 AM, Eugeny Morozov <[EMAIL PROTECTED]> wrote: > Ted, > > that is correct. > HBase 0.92.x and we use part of the patch 6509. > > I use the filter as a custom filter, it lives in separate jar file and goes > to HBase's classpath. I did not patch HBase. > Moreover I do not use protobuf's descriptions that comes with the filter in > patch. Only two classes I have - FuzzyRowFilter itself and its test class. > > And it works perfectly on small dataset like 100 rows (1 region). But when > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm > not sure, but it seems to me it is not fault of the filter. > > > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x >> >> Looks like you were using patch from HBASE-6509 which was integrated to >> trunk only. >> Please confirm. >> >> Copying Alex who wrote the patch. >> >> Cheers >> >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov >> <[EMAIL PROTECTED]>wrote: >> >>> Hi, folks! >>> >>> HBase, Hadoop, etc version is CDH-4.1.2 >>> >>> I'm using custom FuzzyRowFilter, which I get from >> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and>>> suddenly after quite a time we found that it starts loosing data. >>> >>> Basically the idea of FuzzyRowFilter is that it tries to find key that >> has >>> been provided and if there is no such a key - but more exists in table - >> it >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds >> required >>> key. As I understand, HBase in this key will fast-forward to required >> key - >>> it must be similar or same as to get Scan with setStartRow. >>> >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm able >>> to get it using Scan.setStartRow. >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop >> row >>> or anything related. >>> That's what happening: >>> >>> Fzzy: AAAA1Q7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: AQAAnA96rxTg >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: AgAADQWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: AwAA-Q33Zb9Q >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BAAAOg8oyu7A >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BQAA9gqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BgABZQ7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BwAAbgrpAojg >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CAAAUQWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CQABVgqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CgAAOQ7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CwAALwqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DAAAMwWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DQAADgjqzsIQ >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DgAAOgCcWv9g >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DwAAKg7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EAAAugqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EQAAJAqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EgAABgIOMBgg >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EwAAEwqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FAAACQqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FQAAIAqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FgAAeAWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FwAAAw33Zb9Q >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: F7dt8QWPSIDw >>> >>> It's obvious that my FuzzyRowFilter knows what to search and every time >> it >>> repeats its question. >>> The very first key - I suppose is just the first key of a region where my >>> key is located. >>> The very last key - is the key that is already bigger than what I'm >> trying >>> to find - that's the reason why FuzzyFilter stopped there. >>> >>> Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but >>> unsuccessfully. >>> Do you have any idea how to explain these many trials? >>> >>> Thanks in advance. >>> -- >>> Evgeny Morozov >>> Developer Grid Dynamics >>> Skype: morozov.evgeny
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Eugeny Morozov 2013-01-20, 21:22
Ted, thanks for the question. There are results of investigation. It seems I am mistaken. I thought that scanners are assigned to each regions to scan (and do that in parallel) and that means each scanner should start from the beginning of its region and then fall down to the required record. But currently we have 256 splits in the table by the first byte of values: start - end NA - \x01 \x01 - \x02 \x02 - \x03 ... \xFE - \xFF \xFF - NA And it turns out that the values I've seen are the values from different regions, except two last values - they both reside in just one region: AAAA1Q7iQ9JA : [0 <-- that's the value's first byte (meaning particular region here) AQAAnA96rxTg : [1 AgAADQWPSIDw : [2 ... EwAAEwqVQrTw : [19 FAAACQqVQrTw : [20 FQAAIAqVQrTw : [21 FgAAeAWPSIDw : [22 FwAAAw33Zb9Q : [23 F7dt8QWPSIDw : [23 1. I still don't get, why it skips required value. 2. The only explanation to have such an output I've found is that scanning is searching regions one by one until it found the value. Should it be so? Shouldn't it start from the beginning (if there is no setStartRow) (and in parallel for all regions at once) and in second step (after filter's getHint method) know exactly where to go? On Sat, Jan 19, 2013 at 5:16 PM, Ted <[EMAIL PROTECTED]> wrote: > In your original email you said the first key looked like start key of a > region, can you verify that ? > > Thanks > > On Jan 19, 2013, at 1:36 AM, Eugeny Morozov <[EMAIL PROTECTED]> > wrote: > > > Ted, > > > > that is correct. > > HBase 0.92.x and we use part of the patch 6509. > > > > I use the filter as a custom filter, it lives in separate jar file and > goes > > to HBase's classpath. I did not patch HBase. > > Moreover I do not use protobuf's descriptions that comes with the filter > in > > patch. Only two classes I have - FuzzyRowFilter itself and its test > class. > > > > And it works perfectly on small dataset like 100 rows (1 region). But > when > > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm > > not sure, but it seems to me it is not fault of the filter. > > > > > > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x > >> > >> Looks like you were using patch from HBASE-6509 which was integrated to > >> trunk only. > >> Please confirm. > >> > >> Copying Alex who wrote the patch. > >> > >> Cheers > >> > >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov > >> <[EMAIL PROTECTED]>wrote: > >> > >>> Hi, folks! > >>> > >>> HBase, Hadoop, etc version is CDH-4.1.2 > >>> > >>> I'm using custom FuzzyRowFilter, which I get from > >> > http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and> >>> suddenly after quite a time we found that it starts loosing data. > >>> > >>> Basically the idea of FuzzyRowFilter is that it tries to find key that > >> has > >>> been provided and if there is no such a key - but more exists in table > - > >> it > >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds > >> required > >>> key. As I understand, HBase in this key will fast-forward to required > >> key - > >>> it must be similar or same as to get Scan with setStartRow. > >>> > >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm > able > >>> to get it using Scan.setStartRow. > >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop > >> row > >>> or anything related. > >>> That's what happening: > >>> > >>> Fzzy: AAAA1Q7iQ9JA > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: AQAAnA96rxTg > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: AgAADQWPSIDw > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: AwAA-Q33Zb9Q > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: BAAAOg8oyu7A > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: BQAA9gqVQrTw > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: BgABZQ7iQ9JA > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: BwAAbgrpAojg > >>> Next fzzy: F7dtxwqVQ_Pw > >>> Fzzy: CAAAUQWPSIDw Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugeny Morozov 2013-01-20, 21:22
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Michael Segel 2013-01-21, 00:22
If its the same class and its not a patch, then the first class loaded wins. So if you have a Class Foo and HBase has a Class Foo, your code will never see the light of day. Perhaps I'm stating the obvious but its something to think about when working w Hadoop. On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <[EMAIL PROTECTED]> wrote: > Ted, > > that is correct. > HBase 0.92.x and we use part of the patch 6509. > > I use the filter as a custom filter, it lives in separate jar file and goes > to HBase's classpath. I did not patch HBase. > Moreover I do not use protobuf's descriptions that comes with the filter in > patch. Only two classes I have - FuzzyRowFilter itself and its test class. > > And it works perfectly on small dataset like 100 rows (1 region). But when > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm > not sure, but it seems to me it is not fault of the filter. > > > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x >> >> Looks like you were using patch from HBASE-6509 which was integrated to >> trunk only. >> Please confirm. >> >> Copying Alex who wrote the patch. >> >> Cheers >> >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov >> <[EMAIL PROTECTED]>wrote: >> >>> Hi, folks! >>> >>> HBase, Hadoop, etc version is CDH-4.1.2 >>> >>> I'm using custom FuzzyRowFilter, which I get from >>> >>> >> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and>>> suddenly after quite a time we found that it starts loosing data. >>> >>> Basically the idea of FuzzyRowFilter is that it tries to find key that >> has >>> been provided and if there is no such a key - but more exists in table - >> it >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds >> required >>> key. As I understand, HBase in this key will fast-forward to required >> key - >>> it must be similar or same as to get Scan with setStartRow. >>> >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm able >>> to get it using Scan.setStartRow. >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop >> row >>> or anything related. >>> That's what happening: >>> >>> Fzzy: AAAA1Q7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: AQAAnA96rxTg >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: AgAADQWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: AwAA-Q33Zb9Q >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BAAAOg8oyu7A >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BQAA9gqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BgABZQ7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: BwAAbgrpAojg >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CAAAUQWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CQABVgqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CgAAOQ7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: CwAALwqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DAAAMwWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DQAADgjqzsIQ >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DgAAOgCcWv9g >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: DwAAKg7iQ9JA >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EAAAugqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EQAAJAqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EgAABgIOMBgg >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: EwAAEwqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FAAACQqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FQAAIAqVQrTw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FgAAeAWPSIDw >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: FwAAAw33Zb9Q >>> Next fzzy: F7dtxwqVQ_Pw >>> Fzzy: F7dt8QWPSIDw >>> >>> It's obvious that my FuzzyRowFilter knows what to search and every time >> it >>> repeats its question. >>> The very first key - I suppose is just the first key of a region where my >>> key is located. >>> The very last key - is the key that is already bigger than what I'm >> trying >>> to find - that's the reason why FuzzyFilter stopped there. >>> >>> Do you know any issue with SEEK_NEXT_USING_HINT? I've searched, but >>> unsuccessfully.
+
Michael Segel 2013-01-21, 00:22
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Eugeny Morozov 2013-01-21, 08:16
Finally, the mystery has been solved. Small remark before I explain everything. The situation with only region is absolutely the same: Fzzy: AAAA1Q7iQ9JA Next fzzy: F7dtxwqVQ_Pw <-- the value I'm trying to find. Fzzy: F7dt8QWPSIDw Somehow FuzzyRowFilter has just omit my value here. So, the explanation. In javadoc for FuzzyRowFilter question mark is used as substitution for unknown value. Of course it's possible to use anything including zero instead of question mark. For quite some time we used literals to encode our keys. Literals like you've seen already: AAAA1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form of just 8 bytes, which requires 1.5 times more space. So we've decided to store raw version - just byte[8]. But unfortunately the symbol '?' is exactly in the middle of the byte (according to ascii table http://www.asciitable.com/), which means with FuzzyRowFilter we skip half of values in some cases. In the same time question mark is exactly before any letter that could be used in key. Despite the fact we have integration tests - that's just a coincidence we haven't such an example in there. So, as an advice - always use zero instead of question mark for FuzzyRowFilter. Thank's to everyone! P.S. But the question with region scanning order is still here. I do not understand why with FuzzyFilter it goes from one region to another until it stops at the value. I suppose if scanning process has started at once on all regions, then I would find in log files at least one value per region, but I have found one value per region only for those regions, that resides before the particular one. On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > If its the same class and its not a patch, then the first class loaded > wins. > > So if you have a Class Foo and HBase has a Class Foo, your code will never > see the light of day. > > Perhaps I'm stating the obvious but its something to think about when > working w Hadoop. > > On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <[EMAIL PROTECTED]> > wrote: > > > Ted, > > > > that is correct. > > HBase 0.92.x and we use part of the patch 6509. > > > > I use the filter as a custom filter, it lives in separate jar file and > goes > > to HBase's classpath. I did not patch HBase. > > Moreover I do not use protobuf's descriptions that comes with the filter > in > > patch. Only two classes I have - FuzzyRowFilter itself and its test > class. > > > > And it works perfectly on small dataset like 100 rows (1 region). But > when > > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm > > not sure, but it seems to me it is not fault of the filter. > > > > > > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x > >> > >> Looks like you were using patch from HBASE-6509 which was integrated to > >> trunk only. > >> Please confirm. > >> > >> Copying Alex who wrote the patch. > >> > >> Cheers > >> > >> On Fri, Jan 18, 2013 at 3:28 PM, Eugeny Morozov > >> <[EMAIL PROTECTED]>wrote: > >> > >>> Hi, folks! > >>> > >>> HBase, Hadoop, etc version is CDH-4.1.2 > >>> > >>> I'm using custom FuzzyRowFilter, which I get from > >>> > >>> > >> > http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/and> >>> suddenly after quite a time we found that it starts loosing data. > >>> > >>> Basically the idea of FuzzyRowFilter is that it tries to find key that > >> has > >>> been provided and if there is no such a key - but more exists in table > - > >> it > >>> returns SEEK_NEXT_USING_HINT. And in getNextKeyHint(...) it builds > >> required > >>> key. As I understand, HBase in this key will fast-forward to required > >> key - > >>> it must be similar or same as to get Scan with setStartRow. > >>> > >>> I'm trying to find key F7dt8QWPSIDw, it is definitely in HBase - I'm > able > >>> to get it using Scan.setStartRow. > >>> For FuzzyFilter I'm using empty Scan - I didn't specify start row, stop Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugeny Morozov 2013-01-21, 08:16
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
ramkrishna vasudevan 2013-01-21, 08:56
On Mon, Jan 21, 2013 at 1:46 PM, Eugeny Morozov <[EMAIL PROTECTED]>wrote:
> I do not > understand why with FuzzyFilter it goes from one region to another until it > stops at the value. I suppose if scanning process has started at once on > all regions >
Scanning process does not start parallely on all regions. Once a start row is specified with the scan, the corresponding region server is picked up and on that region server, the scan starts from that region which holds the start row and the scan proceeds till it finds the stop row. The stop row can be any of the regions in the same region server, in the exact increasing byte order.
Regards Ram
+
ramkrishna vasudevan 2013-01-21, 08:56
-
RE: Custom Filter and SEEK_NEXT_USING_HINT issue
Anoop Sam John 2013-01-21, 08:59
> I suppose if scanning process has started at once on all regions, then I would find in log files at least one value per region, but I have found one value per region only for those regions, that resides before the particular one. @Eugeny - FuzzyFilter like any other filter works at the server side. The scanning from client side will be like sequential starting from the 1st region (Region with empty startkey or the corresponding region which contains the startkey whatever you mentioned in your scan). From client, request will go to RS for scanning a region. Once that region is over the next region will be contacted for scan(from client) and so on. There is no parallel scanning of multiple regions from client side. [This is when using a HTable scan APIs] When MR used for scanning, we will be doing parallel scans from all the regions. Here will be having mappers per region. But the normal scan from client side will be sequential on the regions not parallel. -Anoop- ________________________________________ From: Eugeny Morozov [[EMAIL PROTECTED]] Sent: Monday, January 21, 2013 1:46 PM To: [EMAIL PROTECTED] Cc: Alex Baranau Subject: Re: Custom Filter and SEEK_NEXT_USING_HINT issue Finally, the mystery has been solved. Small remark before I explain everything. The situation with only region is absolutely the same: Fzzy: AAAA1Q7iQ9JA Next fzzy: F7dtxwqVQ_Pw <-- the value I'm trying to find. Fzzy: F7dt8QWPSIDw Somehow FuzzyRowFilter has just omit my value here. So, the explanation. In javadoc for FuzzyRowFilter question mark is used as substitution for unknown value. Of course it's possible to use anything including zero instead of question mark. For quite some time we used literals to encode our keys. Literals like you've seen already: AAAA1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form of just 8 bytes, which requires 1.5 times more space. So we've decided to store raw version - just byte[8]. But unfortunately the symbol '?' is exactly in the middle of the byte (according to ascii table http://www.asciitable.com/), which means with FuzzyRowFilter we skip half of values in some cases. In the same time question mark is exactly before any letter that could be used in key. Despite the fact we have integration tests - that's just a coincidence we haven't such an example in there. So, as an advice - always use zero instead of question mark for FuzzyRowFilter. Thank's to everyone! P.S. But the question with region scanning order is still here. I do not understand why with FuzzyFilter it goes from one region to another until it stops at the value. I suppose if scanning process has started at once on all regions, then I would find in log files at least one value per region, but I have found one value per region only for those regions, that resides before the particular one. On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel <[EMAIL PROTECTED]>wrote: > If its the same class and its not a patch, then the first class loaded > wins. > > So if you have a Class Foo and HBase has a Class Foo, your code will never > see the light of day. > > Perhaps I'm stating the obvious but its something to think about when > working w Hadoop. > > On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <[EMAIL PROTECTED]> > wrote: > > > Ted, > > > > that is correct. > > HBase 0.92.x and we use part of the patch 6509. > > > > I use the filter as a custom filter, it lives in separate jar file and > goes > > to HBase's classpath. I did not patch HBase. > > Moreover I do not use protobuf's descriptions that comes with the filter > in > > patch. Only two classes I have - FuzzyRowFilter itself and its test > class. > > > > And it works perfectly on small dataset like 100 rows (1 region). But > when > > my dataset is more than 10mln (260 regions), it somehow loosing rows. I'm > > not sure, but it seems to me it is not fault of the filter. > > > > > > On Sat, Jan 19, 2013 at 3:56 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > > > >> To my knowledge CDH-4.1.2 is based on HBase 0.92.x Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Anoop Sam John 2013-01-21, 08:59
-
Re: Custom Filter and SEEK_NEXT_USING_HINT issue
Eugeny Morozov 2013-01-21, 11:44
Anoop, Ramkrishna Thank you for explanation! I've got it. On Mon, Jan 21, 2013 at 12:59 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote: > > I suppose if scanning process has started at once on > all regions, then I would find in log files at least one value per region, > but I have found one value per region only for those regions, that resides > before the particular one. > > @Eugeny - FuzzyFilter like any other filter works at the server side. The > scanning from client side will be like sequential starting from the 1st > region (Region with empty startkey or the corresponding region which > contains the startkey whatever you mentioned in your scan). From client, > request will go to RS for scanning a region. Once that region is over the > next region will be contacted for scan(from client) and so on. There is no > parallel scanning of multiple regions from client side. [This is when > using a HTable scan APIs] > > When MR used for scanning, we will be doing parallel scans from all the > regions. Here will be having mappers per region. But the normal scan from > client side will be sequential on the regions not parallel. > > -Anoop- > ________________________________________ > From: Eugeny Morozov [[EMAIL PROTECTED]] > Sent: Monday, January 21, 2013 1:46 PM > To: [EMAIL PROTECTED] > Cc: Alex Baranau > Subject: Re: Custom Filter and SEEK_NEXT_USING_HINT issue > > Finally, the mystery has been solved. > > Small remark before I explain everything. > > The situation with only region is absolutely the same: > Fzzy: AAAA1Q7iQ9JA > Next fzzy: F7dtxwqVQ_Pw <-- the value I'm trying to find. > Fzzy: F7dt8QWPSIDw > Somehow FuzzyRowFilter has just omit my value here. > > > So, the explanation. > In javadoc for FuzzyRowFilter question mark is used as substitution for > unknown value. Of course it's possible to use anything including zero > instead of question mark. > For quite some time we used literals to encode our keys. Literals like > you've seen already: AAAA1Q7iQ9JA or F7dt8QWPSIDw. But that's Base64 form > of just 8 bytes, which requires 1.5 times more space. So we've decided to > store raw version - just byte[8]. But unfortunately the symbol '?' is > exactly in the middle of the byte (according to ascii table > http://www.asciitable.com/), which means with FuzzyRowFilter we skip half > of values in some cases. In the same time question mark is exactly before > any letter that could be used in key. > > Despite the fact we have integration tests - that's just a coincidence we > haven't such an example in there. > > So, as an advice - always use zero instead of question mark for > FuzzyRowFilter. > > Thank's to everyone! > > P.S. But the question with region scanning order is still here. I do not > understand why with FuzzyFilter it goes from one region to another until it > stops at the value. I suppose if scanning process has started at once on > all regions, then I would find in log files at least one value per region, > but I have found one value per region only for those regions, that resides > before the particular one. > > > On Mon, Jan 21, 2013 at 4:22 AM, Michael Segel <[EMAIL PROTECTED] > >wrote: > > > If its the same class and its not a patch, then the first class loaded > > wins. > > > > So if you have a Class Foo and HBase has a Class Foo, your code will > never > > see the light of day. > > > > Perhaps I'm stating the obvious but its something to think about when > > working w Hadoop. > > > > On Jan 19, 2013, at 3:36 AM, Eugeny Morozov <[EMAIL PROTECTED]> > > wrote: > > > > > Ted, > > > > > > that is correct. > > > HBase 0.92.x and we use part of the patch 6509. > > > > > > I use the filter as a custom filter, it lives in separate jar file and > > goes > > > to HBase's classpath. I did not patch HBase. > > > Moreover I do not use protobuf's descriptions that comes with the > filter > > in > > > patch. Only two classes I have - FuzzyRowFilter itself and its test > > class. > > > > > Evgeny Morozov Developer Grid Dynamics Skype: morozov.evgeny www.griddynamics.com [EMAIL PROTECTED]
+
Eugeny Morozov 2013-01-21, 11:44
|
|