|
Jerry Lam
2012-08-27, 21:40
lars hofhansl
2012-08-28, 00:11
Jerry Lam
2012-08-28, 00:59
lars hofhansl
2012-08-28, 04:54
Jerry Lam
2012-08-28, 14:17
lars hofhansl
2012-08-28, 18:21
Jerry Lam
2012-08-28, 20:52
lars hofhansl
2012-08-29, 06:09
Jerry Lam
2012-08-29, 13:59
Ted Yu
2012-08-29, 17:47
Jerry Lam
2012-08-29, 18:36
|
-
setTimeRange and setMaxVersions seem to be inefficientJerry Lam 2012-08-27, 21:40
Hi HBase community:
I tried to use setTimeRange and setMaxVersions to limit the number of KVs return per column. The behaviour is as I would expect that is setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV with timestamp that is less than or equal to T. However, I noticed that all versions of the KeyValue for a particular column are processed through a custom filter I implemented even though I specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE KV of a particular column has ReturnCode.INCLUDE, the framework will jump to the next COL instead of iterating through all versions of the column. Can someone confirm me if this is the expected behaviour (iterating through all versions of a column before setMaxVersions take effect)? If this is an expected behaviour, what is your recommendation to speed this up? Best Regards, Jerry
-
Re: setTimeRange and setMaxVersions seem to be inefficientlars hofhansl 2012-08-28, 00:11
Currently filters are evaluated before we do version counting.
Here's a comment from ScanQueryMatcher.java: /** * Filters should be checked before checking column trackers. If we do * otherwise, as was previously being done, ColumnTracker may increment its * counter for even that KV which may be discarded later on by Filter. This * would lead to incorrect results in certain cases. */ So this is by design. (Doesn't mean it's correct or desirable, though.) -- Lars ----- Original Message ----- From: Jerry Lam <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Cc: Sent: Monday, August 27, 2012 2:40 PM Subject: setTimeRange and setMaxVersions seem to be inefficient Hi HBase community: I tried to use setTimeRange and setMaxVersions to limit the number of KVs return per column. The behaviour is as I would expect that is setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV with timestamp that is less than or equal to T. However, I noticed that all versions of the KeyValue for a particular column are processed through a custom filter I implemented even though I specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE KV of a particular column has ReturnCode.INCLUDE, the framework will jump to the next COL instead of iterating through all versions of the column. Can someone confirm me if this is the expected behaviour (iterating through all versions of a column before setMaxVersions take effect)? If this is an expected behaviour, what is your recommendation to speed this up? Best Regards, Jerry
-
Re: setTimeRange and setMaxVersions seem to be inefficientJerry Lam 2012-08-28, 00:59
Hi Lars:
Thanks for confirming the inefficiency of the implementation for this case. For my case, a column can have more than 10K versions, I need a quick way to stop the scan from digging the column once there is a match (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify the framework to stop and go to next column once the number of versions specify in setMaxVersions is met. For now, I guess I have to hack it in the custom filter (I.e. I keep the count myself)? If you have a better way to achieve this, please share :) Best Regards, Jerry Sent from my iPad (sorry for spelling mistakes) On 2012-08-27, at 20:11, lars hofhansl <[EMAIL PROTECTED]> wrote: > Currently filters are evaluated before we do version counting. > > Here's a comment from ScanQueryMatcher.java: > /** > * Filters should be checked before checking column trackers. If we do > * otherwise, as was previously being done, ColumnTracker may increment its > * counter for even that KV which may be discarded later on by Filter. This > * would lead to incorrect results in certain cases. > */ > > > So this is by design. (Doesn't mean it's correct or desirable, though.) > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Cc: > Sent: Monday, August 27, 2012 2:40 PM > Subject: setTimeRange and setMaxVersions seem to be inefficient > > Hi HBase community: > > I tried to use setTimeRange and setMaxVersions to limit the number of KVs > return per column. The behaviour is as I would expect that is > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV > with timestamp that is less than or equal to T. > However, I noticed that all versions of the KeyValue for a particular > column are processed through a custom filter I implemented even though I > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE > KV of a particular column has ReturnCode.INCLUDE, the framework will jump > to the next COL instead of iterating through all versions of the column. > > Can someone confirm me if this is the expected behaviour (iterating through > all versions of a column before setMaxVersions take effect)? If this is an > expected behaviour, what is your recommendation to speed this up? > > Best Regards, > > Jerry >
-
Re: setTimeRange and setMaxVersions seem to be inefficientlars hofhansl 2012-08-28, 04:54
First off regarding "inefficiency"... If version counting would happen first and then filter were executed we'd have folks "complaining" about inefficiencies as well:
("Why does the code have to go through the versioning stuff when my filter filters the row/column/version anyway?") ;-) For your problem, you want to make use of "seek hints"... In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...). That way the scanning framework will know to skip ahead to the next column, row, or a KV of your choosing. (see Filter.filterKeyValue and Filter.getNextKeyHint). (as an aside, it would probably be nice if Filters also had INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner) Have a look at ColumnPrefixFilter as an example. I also wrote a short post here: http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html Does that help? -- Lars ----- Original Message ----- From: Jerry Lam <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Monday, August 27, 2012 5:59 PM Subject: Re: setTimeRange and setMaxVersions seem to be inefficient Hi Lars: Thanks for confirming the inefficiency of the implementation for this case. For my case, a column can have more than 10K versions, I need a quick way to stop the scan from digging the column once there is a match (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify the framework to stop and go to next column once the number of versions specify in setMaxVersions is met. For now, I guess I have to hack it in the custom filter (I.e. I keep the count myself)? If you have a better way to achieve this, please share :) Best Regards, Jerry Sent from my iPad (sorry for spelling mistakes) On 2012-08-27, at 20:11, lars hofhansl <[EMAIL PROTECTED]> wrote: > Currently filters are evaluated before we do version counting. > > Here's a comment from ScanQueryMatcher.java: > /** > * Filters should be checked before checking column trackers. If we do > * otherwise, as was previously being done, ColumnTracker may increment its > * counter for even that KV which may be discarded later on by Filter. This > * would lead to incorrect results in certain cases. > */ > > > So this is by design. (Doesn't mean it's correct or desirable, though.) > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Cc: > Sent: Monday, August 27, 2012 2:40 PM > Subject: setTimeRange and setMaxVersions seem to be inefficient > > Hi HBase community: > > I tried to use setTimeRange and setMaxVersions to limit the number of KVs > return per column. The behaviour is as I would expect that is > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV > with timestamp that is less than or equal to T. > However, I noticed that all versions of the KeyValue for a particular > column are processed through a custom filter I implemented even though I > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE > KV of a particular column has ReturnCode.INCLUDE, the framework will jump > to the next COL instead of iterating through all versions of the column. > > Can someone confirm me if this is the expected behaviour (iterating through > all versions of a column before setMaxVersions take effect)? If this is an > expected behaviour, what is your recommendation to speed this up? > > Best Regards, > > Jerry >
-
Re: setTimeRange and setMaxVersions seem to be inefficientJerry Lam 2012-08-28, 14:17
Hi Lars:
Thanks for the reply. I need to understand if I misunderstood the perceived inefficiency because it seems you don't think quite the same. Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a table and each column has 1000 versions. Using the following code (the code might have errors and don't compile): /** * This is very simple use case of a ColumnPrefixFilter. * In fact all other filters that make use of filterKeyValue will see similar * performance problems that I have concerned with when the number of * versions per column could be huge. Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2")); Scan scan = new Scan(); scan.setFilter(filter); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { for (KeyValue kv : result.raw()) { System.out.println("KV: " + kv + ", Value: " + Bytes.toString(kv.getValue())); } } scanner.close(); */ Implicitly, the number of version per column that is going to return is 1 (the latest version). User might expect that only 2 comparisons for column prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and 1000 for col-2) for col-2 (1 per version) because all versions of the column have the same prefix for obvious reason. For col-1, it will skip using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1. In summary, the 1000 comparisons (5000 byte comparisons) for the column prefix "col-2" is wasted because only 1 version is returned to user. Also, I believe this inefficiency is hidden from the user code but it affects all filters that use filterKeyValue as the main execution for filtering KVs. Do we have a case to improve HBase to handle this inefficiency? :) It seems valid unless you prove otherwise. Best Regards, Jerry On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > First off regarding "inefficiency"... If version counting would happen > first and then filter were executed we'd have folks "complaining" about > inefficiencies as well: > ("Why does the code have to go through the versioning stuff when my filter > filters the row/column/version anyway?") ;-) > > > For your problem, you want to make use of "seek hints"... > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...). > > That way the scanning framework will know to skip ahead to the next > column, row, or a KV of your choosing. (see Filter.filterKeyValue and > Filter.getNextKeyHint). > > (as an aside, it would probably be nice if Filters also had > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner) > > Have a look at ColumnPrefixFilter as an example. > I also wrote a short post here: > http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html > > Does that help? > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Sent: Monday, August 27, 2012 5:59 PM > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > Hi Lars: > > Thanks for confirming the inefficiency of the implementation for this > case. For my case, a column can have more than 10K versions, I need a quick > way to stop the scan from digging the column once there is a match > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify > the framework to stop and go to next column once the number of versions > specify in setMaxVersions is met. > > For now, I guess I have to hack it in the custom filter (I.e. I keep the > count myself)? If you have a better way to achieve this, please share :) > > Best Regards, > > Jerry > > Sent from my iPad (sorry for spelling mistakes) > > On 2012-08-27, at 20:11, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > Currently filters are evaluated before we do version counting.
-
Re: setTimeRange and setMaxVersions seem to be inefficientlars hofhansl 2012-08-28, 18:21
What I was saying was: It depends. :)
First off, how do you get to 1000 versions? In 0.94++ older version are pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF) to get 1000 versions. By that time some compactions will have happened and you're back to close to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you have). Now, if you have that many version because because you set VERSIONS=>1000 in your CF... Then imagine you have 100 columns with 1000 versions each. In your scenario below you'd do 100000 comparisons if the filter would be evaluated after the version counting. But only 1100 with the current code. (or at least in that ball park) The gist is: One can construct scenarios where one approach is better than the other. Only one order is possible. If you write a custom filter and you care about these things you should use the seek hints. -- Lars ----- Original Message ----- From: Jerry Lam <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Cc: Sent: Tuesday, August 28, 2012 7:17 AM Subject: Re: setTimeRange and setMaxVersions seem to be inefficient Hi Lars: Thanks for the reply. I need to understand if I misunderstood the perceived inefficiency because it seems you don't think quite the same. Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a table and each column has 1000 versions. Using the following code (the code might have errors and don't compile): /** * This is very simple use case of a ColumnPrefixFilter. * In fact all other filters that make use of filterKeyValue will see similar * performance problems that I have concerned with when the number of * versions per column could be huge. Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2")); Scan scan = new Scan(); scan.setFilter(filter); ResultScanner scanner = table.getScanner(scan); for (Result result : scanner) { for (KeyValue kv : result.raw()) { System.out.println("KV: " + kv + ", Value: " + Bytes.toString(kv.getValue())); } } scanner.close(); */ Implicitly, the number of version per column that is going to return is 1 (the latest version). User might expect that only 2 comparisons for column prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and 1000 for col-2) for col-2 (1 per version) because all versions of the column have the same prefix for obvious reason. For col-1, it will skip using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1. In summary, the 1000 comparisons (5000 byte comparisons) for the column prefix "col-2" is wasted because only 1 version is returned to user. Also, I believe this inefficiency is hidden from the user code but it affects all filters that use filterKeyValue as the main execution for filtering KVs. Do we have a case to improve HBase to handle this inefficiency? :) It seems valid unless you prove otherwise. Best Regards, Jerry On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > First off regarding "inefficiency"... If version counting would happen > first and then filter were executed we'd have folks "complaining" about > inefficiencies as well: > ("Why does the code have to go through the versioning stuff when my filter > filters the row/column/version anyway?") ;-) > > > For your problem, you want to make use of "seek hints"... > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...). > > That way the scanning framework will know to skip ahead to the next > column, row, or a KV of your choosing. (see Filter.filterKeyValue and > Filter.getNextKeyHint). > > (as an aside, it would probably be nice if Filters also had > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner) > > Have a look at ColumnPrefixFilter as an example. > I also wrote a short post here:
-
Re: setTimeRange and setMaxVersions seem to be inefficientJerry Lam 2012-08-28, 20:52
Hi Lars:
I see. Please refer to the inline comment below. Best Regards, Jerry On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > What I was saying was: It depends. :) > > First off, how do you get to 1000 versions? In 0.94++ older version are > pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF) > to get 1000 versions. > I forgot that the default number of version to keep is 3. If this is what people use most of the time, yes you are right for this type of scenarios where the number of version per column to keep is small. By that time some compactions will have happened and you're back to close > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you > have). > > Now, if you have that many version because because you set VERSIONS=>1000 > in your CF... Then imagine you have 100 columns with 1000 versions each. > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the versioning myself) In your scenario below you'd do 100000 comparisons if the filter would be > evaluated after the version counting. But only 1100 with the current code. > (or at least in that ball park) > This is where I don't quite understand what you mean. if the framework counts the number of ReturnCode.INCLUDE and then stops feeding the KeyValue into the filterKeyValue method after it reaches the count specified in setMaxVersions (i.e. 1 for the case we discussed), should then be just 100 comparisons only (at most) instead of 1100 comparisons? Maybe I don't understand how the current way is doing... > > The gist is: One can construct scenarios where one approach is better than > the other. Only one order is possible. > If you write a custom filter and you care about these things you should > use the seek hints. > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: > Sent: Tuesday, August 28, 2012 7:17 AM > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > Hi Lars: > > Thanks for the reply. > I need to understand if I misunderstood the perceived inefficiency because > it seems you don't think quite the same. > > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a > table and each column has 1000 versions. Using the following code (the code > might have errors and don't compile): > /** > * This is very simple use case of a ColumnPrefixFilter. > * In fact all other filters that make use of filterKeyValue will see > similar > * performance problems that I have concerned with when the number of > * versions per column could be huge. > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2")); > Scan scan = new Scan(); > scan.setFilter(filter); > ResultScanner scanner = table.getScanner(scan); > for (Result result : scanner) { > for (KeyValue kv : result.raw()) { > System.out.println("KV: " + kv + ", Value: " + > Bytes.toString(kv.getValue())); > } > } > scanner.close(); > */ > > Implicitly, the number of version per column that is going to return is 1 > (the latest version). User might expect that only 2 comparisons for column > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and > 1000 for col-2) for col-2 (1 per version) because all versions of the > column have the same prefix for obvious reason. For col-1, it will skip > using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1. > > In summary, the 1000 comparisons (5000 byte comparisons) for the column > prefix "col-2" is wasted because only 1 version is returned to user. Also, > I believe this inefficiency is hidden from the user code but it affects all > filters that use filterKeyValue as the main execution for filtering KVs. Do > we have a case to improve HBase to handle this inefficiency? :) It seems > valid unless you prove otherwise.
-
Re: setTimeRange and setMaxVersions seem to be inefficientlars hofhansl 2012-08-29, 06:09
Hi Jerry,
my answer will be the same again: Some folks will want the max versions set by the client to be before filters and some folks will want it to restrict the end result. It's not possible to have it both ways. Your filter needs to do the right thing. There's a lot of discussion around this in HBASE-5104. -- Lars ________________________________ From: Jerry Lam <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> Sent: Tuesday, August 28, 2012 1:52 PM Subject: Re: setTimeRange and setMaxVersions seem to be inefficient Hi Lars: I see. Please refer to the inline comment below. Best Regards, Jerry On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > What I was saying was: It depends. :) > > First off, how do you get to 1000 versions? In 0.94++ older version are > pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF) > to get 1000 versions. > I forgot that the default number of version to keep is 3. If this is what people use most of the time, yes you are right for this type of scenarios where the number of version per column to keep is small. By that time some compactions will have happened and you're back to close > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you > have). > > Now, if you have that many version because because you set VERSIONS=>1000 > in your CF... Then imagine you have 100 columns with 1000 versions each. > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the versioning myself) In your scenario below you'd do 100000 comparisons if the filter would be > evaluated after the version counting. But only 1100 with the current code. > (or at least in that ball park) > This is where I don't quite understand what you mean. if the framework counts the number of ReturnCode.INCLUDE and then stops feeding the KeyValue into the filterKeyValue method after it reaches the count specified in setMaxVersions (i.e. 1 for the case we discussed), should then be just 100 comparisons only (at most) instead of 1100 comparisons? Maybe I don't understand how the current way is doing... > > The gist is: One can construct scenarios where one approach is better than > the other. Only one order is possible. > If you write a custom filter and you care about these things you should > use the seek hints. > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Cc: > Sent: Tuesday, August 28, 2012 7:17 AM > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > Hi Lars: > > Thanks for the reply. > I need to understand if I misunderstood the perceived inefficiency because > it seems you don't think quite the same. > > Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a > table and each column has 1000 versions. Using the following code (the code > might have errors and don't compile): > /** > * This is very simple use case of a ColumnPrefixFilter. > * In fact all other filters that make use of filterKeyValue will see > similar > * performance problems that I have concerned with when the number of > * versions per column could be huge. > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2")); > Scan scan = new Scan(); > scan.setFilter(filter); > ResultScanner scanner = table.getScanner(scan); > for (Result result : scanner) { > for (KeyValue kv : result.raw()) { > System.out.println("KV: " + kv + ", Value: " + > Bytes.toString(kv.getValue())); > } > } > scanner.close(); > */ > > Implicitly, the number of version per column that is going to return is 1 > (the latest version). User might expect that only 2 comparisons for column > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and > 1000 for col-2) for col-2 (1 per version) because all versions of the
-
Re: setTimeRange and setMaxVersions seem to be inefficientJerry Lam 2012-08-29, 13:59
Hi Lars:
Thanks for spending time discussing this with me. I appreciate it. I tried to implement the setMaxVersions(1) inside the filter as follows: @Override public ReturnCode filterKeyValue(KeyValue kv) { // check if the same qualifier as the one that has been included previously. If yes, jump to next column if (previousIncludedQualifier != null && Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) { previousIncludedQualifier = null; return ReturnCode.NEXT_COL; } // another condition that makes the jump further using HINT if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) { LOG.info("Matched Found."); return ReturnCode.SEEK_NEXT_USING_HINT; } // include this to the result and keep track of the included qualifier so the next version of the same qualifier will be excluded previousIncludedQualifier = kv.getQualifier(); return ReturnCode.INCLUDE; } Does this look reasonable or there is a better way to achieve this? It would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though. Best Regards, Jerry On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <[EMAIL PROTECTED]> wrote: > Hi Jerry, > > my answer will be the same again: > Some folks will want the max versions set by the client to be before > filters and some folks will want it to restrict the end result. > It's not possible to have it both ways. Your filter needs to do the right > thing. > > > There's a lot of discussion around this in HBASE-5104. > > > -- Lars > > > > ________________________________ > From: Jerry Lam <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > Sent: Tuesday, August 28, 2012 1:52 PM > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > Hi Lars: > > I see. Please refer to the inline comment below. > > Best Regards, > > Jerry > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > What I was saying was: It depends. :) > > > > First off, how do you get to 1000 versions? In 0.94++ older version are > > pruned upon flush, so you need 333 flushes (assuming 3 versions on the > CF) > > to get 1000 versions. > > > > I forgot that the default number of version to keep is 3. If this is what > people use most of the time, yes you are right for this type of scenarios > where the number of version per column to keep is small. > > By that time some compactions will have happened and you're back to close > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you > > have). > > > > Now, if you have that many version because because you set VERSIONS=>1000 > > in your CF... Then imagine you have 100 columns with 1000 versions each. > > > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the > versioning myself) > > In your scenario below you'd do 100000 comparisons if the filter would be > > evaluated after the version counting. But only 1100 with the current > code. > > (or at least in that ball park) > > > > This is where I don't quite understand what you mean. > > if the framework counts the number of ReturnCode.INCLUDE and then stops > feeding the KeyValue into the filterKeyValue method after it reaches the > count specified in setMaxVersions (i.e. 1 for the case we discussed), > should then be just 100 comparisons only (at most) instead of 1100 > comparisons? Maybe I don't understand how the current way is doing... > > > > > > > The gist is: One can construct scenarios where one approach is better > than > > the other. Only one order is possible. > > If you write a custom filter and you care about these things you should > > use the seek hints. > > > > -- Lars > > > > > > ----- Original Message ----- > > From: Jerry Lam <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > Cc: > > Sent: Tuesday, August 28, 2012 7:17 AM > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > Hi Lars: > > > > Thanks for the reply.
-
Re: setTimeRange and setMaxVersions seem to be inefficientTed Yu 2012-08-29, 17:47
Jerry:
Remember to also implement: + @Override + public KeyValue getNextKeyHint(KeyValue currentKV) { You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL. Cheers On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: > Hi Lars: > > Thanks for spending time discussing this with me. I appreciate it. > > I tried to implement the setMaxVersions(1) inside the filter as follows: > > @Override > public ReturnCode filterKeyValue(KeyValue kv) { > > // check if the same qualifier as the one that has been included > previously. If yes, jump to next column > if (previousIncludedQualifier != null && > Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) { > previousIncludedQualifier = null; > return ReturnCode.NEXT_COL; > } > // another condition that makes the jump further using HINT > if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) { > LOG.info("Matched Found."); > return ReturnCode.SEEK_NEXT_USING_HINT; > > } > // include this to the result and keep track of the included > qualifier so the next version of the same qualifier will be excluded > previousIncludedQualifier = kv.getQualifier(); > return ReturnCode.INCLUDE; > } > > Does this look reasonable or there is a better way to achieve this? It > would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case though. > > Best Regards, > > Jerry > > > On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > Hi Jerry, > > > > my answer will be the same again: > > Some folks will want the max versions set by the client to be before > > filters and some folks will want it to restrict the end result. > > It's not possible to have it both ways. Your filter needs to do the right > > thing. > > > > > > There's a lot of discussion around this in HBASE-5104. > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Jerry Lam <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > Sent: Tuesday, August 28, 2012 1:52 PM > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > Hi Lars: > > > > I see. Please refer to the inline comment below. > > > > Best Regards, > > > > Jerry > > > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > What I was saying was: It depends. :) > > > > > > First off, how do you get to 1000 versions? In 0.94++ older version are > > > pruned upon flush, so you need 333 flushes (assuming 3 versions on the > > CF) > > > to get 1000 versions. > > > > > > > I forgot that the default number of version to keep is 3. If this is what > > people use most of the time, yes you are right for this type of scenarios > > where the number of version per column to keep is small. > > > > By that time some compactions will have happened and you're back to close > > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files > you > > > have). > > > > > > Now, if you have that many version because because you set > VERSIONS=>1000 > > > in your CF... Then imagine you have 100 columns with 1000 versions > each. > > > > > > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the > > versioning myself) > > > > In your scenario below you'd do 100000 comparisons if the filter would be > > > evaluated after the version counting. But only 1100 with the current > > code. > > > (or at least in that ball park) > > > > > > > This is where I don't quite understand what you mean. > > > > if the framework counts the number of ReturnCode.INCLUDE and then stops > > feeding the KeyValue into the filterKeyValue method after it reaches the > > count specified in setMaxVersions (i.e. 1 for the case we discussed), > > should then be just 100 comparisons only (at most) instead of 1100 > > comparisons? Maybe I don't understand how the current way is doing... > > > > > > > > > > > > The gist is: One can construct scenarios where one approach is better > > than
-
Re: setTimeRange and setMaxVersions seem to be inefficientJerry Lam 2012-08-29, 18:36
Hi Ted:
Sure, will do. I also implement the reset method to set previousIncludedQualifier to null for the next row to come. Best Regards, Jerry On Wed, Aug 29, 2012 at 1:47 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Jerry: > Remember to also implement: > > + @Override > + public KeyValue getNextKeyHint(KeyValue currentKV) { > > You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL. > > Cheers > > On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam <[EMAIL PROTECTED]> wrote: > > > Hi Lars: > > > > Thanks for spending time discussing this with me. I appreciate it. > > > > I tried to implement the setMaxVersions(1) inside the filter as follows: > > > > @Override > > public ReturnCode filterKeyValue(KeyValue kv) { > > > > // check if the same qualifier as the one that has been included > > previously. If yes, jump to next column > > if (previousIncludedQualifier != null && > > Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) { > > previousIncludedQualifier = null; > > return ReturnCode.NEXT_COL; > > } > > // another condition that makes the jump further using HINT > > if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) { > > LOG.info("Matched Found."); > > return ReturnCode.SEEK_NEXT_USING_HINT; > > > > } > > // include this to the result and keep track of the included > > qualifier so the next version of the same qualifier will be excluded > > previousIncludedQualifier = kv.getQualifier(); > > return ReturnCode.INCLUDE; > > } > > > > Does this look reasonable or there is a better way to achieve this? It > > would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case > though. > > > > Best Regards, > > > > Jerry > > > > > > On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Jerry, > > > > > > my answer will be the same again: > > > Some folks will want the max versions set by the client to be before > > > filters and some folks will want it to restrict the end result. > > > It's not possible to have it both ways. Your filter needs to do the > right > > > thing. > > > > > > > > > There's a lot of discussion around this in HBASE-5104. > > > > > > > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: Jerry Lam <[EMAIL PROTECTED]> > > > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]> > > > Sent: Tuesday, August 28, 2012 1:52 PM > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > > > Hi Lars: > > > > > > I see. Please refer to the inline comment below. > > > > > > Best Regards, > > > > > > Jerry > > > > > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <[EMAIL PROTECTED]> > > > wrote: > > > > > > > What I was saying was: It depends. :) > > > > > > > > First off, how do you get to 1000 versions? In 0.94++ older version > are > > > > pruned upon flush, so you need 333 flushes (assuming 3 versions on > the > > > CF) > > > > to get 1000 versions. > > > > > > > > > > I forgot that the default number of version to keep is 3. If this is > what > > > people use most of the time, yes you are right for this type of > scenarios > > > where the number of version per column to keep is small. > > > > > > By that time some compactions will have happened and you're back to > close > > > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files > > you > > > > have). > > > > > > > > Now, if you have that many version because because you set > > VERSIONS=>1000 > > > > in your CF... Then imagine you have 100 columns with 1000 versions > > each. > > > > > > > > > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the > > > versioning myself) > > > > > > In your scenario below you'd do 100000 comparisons if the filter would > be > > > > evaluated after the version counting. But only 1100 with the current > > > code. > > > > (or at least in that ball park) > > > > > > > > > > This is where I don't quite understand what you mean. > > > > > > if the framework counts the number of ReturnCode.INCLUDE and then stops |