Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> setTimeRange and setMaxVersions seem to be inefficient


Copy link to this message
-
Re: setTimeRange and setMaxVersions seem to be inefficient
Hi Lars:

Thanks for the reply.
I need to understand if I misunderstood the perceived inefficiency because
it seems you don't think quite the same.

Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
table and each column has 1000 versions. Using the following code (the code
might have errors and don't compile):
/**
 * This is very simple use case of a ColumnPrefixFilter.
 * In fact all other filters that make use of filterKeyValue will see
similar
 * performance problems that I have concerned with when the number of
 * versions per column could be huge.

Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    for (KeyValue kv : result.raw()) {
        System.out.println("KV: " + kv + ", Value: " +
        Bytes.toString(kv.getValue()));
    }
}
scanner.close();
*/

Implicitly, the number of version per column that is going to return is 1
(the latest version). User might expect that only 2 comparisons for column
prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
1000 for col-2) for col-2 (1 per version) because all versions of the
column have the same prefix for obvious reason. For col-1, it will skip
using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.

In summary, the 1000 comparisons (5000 byte comparisons) for the column
prefix "col-2" is wasted because only 1 version is returned to user. Also,
I believe this inefficiency is hidden from the user code but it affects all
filters that use filterKeyValue as the main execution for filtering KVs. Do
we have a case to improve HBase to handle this inefficiency? :) It seems
valid unless you prove otherwise.

Best Regards,

Jerry

On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> First off regarding "inefficiency"... If version counting would happen
> first and then filter were executed we'd have folks "complaining" about
> inefficiencies as well:
> ("Why does the code have to go through the versioning stuff when my filter
> filters the row/column/version anyway?")  ;-)
>
>
> For your problem, you want to make use of "seek hints"...
>
> In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
>
> That way the scanning framework will know to skip ahead to the next
> column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> Filter.getNextKeyHint).
>
> (as an aside, it would probably be nice if Filters also had
> INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner)
>
> Have a look at ColumnPrefixFilter as an example.
> I also wrote a short post here:
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
>
> Does that help?
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Monday, August 27, 2012 5:59 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for confirming the inefficiency of the implementation for this
> case. For my case, a column can have more than 10K versions, I need a quick
> way to stop the scan from digging the column once there is a match
> (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify
> the framework to stop and go to next column once the number of versions
> specify in setMaxVersions is met.
>
> For now, I guess I have to hack it in the custom filter (I.e. I keep the
> count myself)? If you have a better way to achieve this, please share :)
>
> Best Regards,
>
> Jerry
>
> Sent from my iPad (sorry for spelling mistakes)
>
> On 2012-08-27, at 20:11, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > Currently filters are evaluated before we do version counting.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB