Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> setTimeRange and setMaxVersions seem to be inefficient


Copy link to this message
-
Re: setTimeRange and setMaxVersions seem to be inefficient
Hi Lars:

I see. Please refer to the inline comment below.

Best Regards,

Jerry

On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> What I was saying was: It depends. :)
>
> First off, how do you get to 1000 versions? In 0.94++ older version are
> pruned upon flush, so you need 333 flushes (assuming 3 versions on the CF)
> to get 1000 versions.
>

I forgot that the default number of version to keep is 3. If this is what
people use most of the time, yes you are right for this type of scenarios
where the number of version per column to keep is small.

By that time some compactions will have happened and you're back to close
> to 3 versions (maybe 9, 12, or 15 or so, depending on how store files you
> have).
>
> Now, if you have that many version because because you set VERSIONS=>1000
> in your CF... Then imagine you have 100 columns with 1000 versions each.
>

Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
versioning myself)

In your scenario below you'd do 100000 comparisons if the filter would be
> evaluated after the version counting. But only 1100 with the current code.
> (or at least in that ball park)
>

This is where I don't quite understand what you mean.

if the framework counts the number of ReturnCode.INCLUDE and then stops
feeding the KeyValue into the filterKeyValue method after it reaches the
count specified in setMaxVersions (i.e. 1 for the case we discussed),
should then be just 100 comparisons only (at most) instead of 1100
comparisons? Maybe I don't understand how the current way is doing...

>
> The gist is: One can construct scenarios where one approach is better than
> the other. Only one order is possible.
> If you write a custom filter and you care about these things you should
> use the seek hints.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
> Cc:
> Sent: Tuesday, August 28, 2012 7:17 AM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for the reply.
> I need to understand if I misunderstood the perceived inefficiency because
> it seems you don't think quite the same.
>
> Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
> table and each column has 1000 versions. Using the following code (the code
> might have errors and don't compile):
> /**
> * This is very simple use case of a ColumnPrefixFilter.
> * In fact all other filters that make use of filterKeyValue will see
> similar
> * performance problems that I have concerned with when the number of
> * versions per column could be huge.
>
> Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> Scan scan = new Scan();
> scan.setFilter(filter);
> ResultScanner scanner = table.getScanner(scan);
> for (Result result : scanner) {
>     for (KeyValue kv : result.raw()) {
>         System.out.println("KV: " + kv + ", Value: " +
>         Bytes.toString(kv.getValue()));
>     }
> }
> scanner.close();
> */
>
> Implicitly, the number of version per column that is going to return is 1
> (the latest version). User might expect that only 2 comparisons for column
> prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
> the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
> 1000 for col-2) for col-2 (1 per version) because all versions of the
> column have the same prefix for obvious reason. For col-1, it will skip
> using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.
>
> In summary, the 1000 comparisons (5000 byte comparisons) for the column
> prefix "col-2" is wasted because only 1 version is returned to user. Also,
> I believe this inefficiency is hidden from the user code but it affects all
> filters that use filterKeyValue as the main execution for filtering KVs. Do
> we have a case to improve HBase to handle this inefficiency? :) It seems
> valid unless you prove otherwise.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB