Tom Brown 2012-09-12, 21:52
Xiang Hua 2012-09-13, 04:45
n keywal 2012-09-12, 22:08
Tom Brown 2012-09-12, 22:42
-RE: Performance of scan setTimeRange VS manually doing it
Anoop Sam John 2012-09-13, 03:56
I think your guess is correct. When the HFile can not be skipped as the max and min TS overlap with the given time range, that file will be scanned fully and certain rows will be filtered out. Those are read from HDFS.
When you do the reseeks many such read can be avoided.. Remember that HFiles are split into HBlocks and from HDFS we will read one block after the other. So doing this reseeks might be skipping many blocks..
From: Tom Brown [[EMAIL PROTECTED]]
Sent: Thursday, September 13, 2012 4:12 AM
To: [EMAIL PROTECTED]
Subject: Re: Performance of scan setTimeRange VS manually doing it
It seems like the the internal logic for handling a time range is two
part: First, as you said, each file contains the minimum and maximum
timestamps contained within. This provides a very rough filter for the
data, but if your data is right, the effect can be huge. Second, a
time range acts a simple filter during a scan; While looking for the
next row to return, it checks whether the timestamp for the row is
within the time range; Returns that row if it is, and continues to the
next row if it isn't.
What it *doesn't* appear to do, however, is reseek to the row with the
minimum timestamp. Since my row key also contains a copy of the
timestamp, a reseek is able to bypass a lot of rows that the generic
logic would test individually. Perhaps HBase itself could be made to
work this way, but I'm unsure enough of its internal workings that I
can't say for sure.
(The above is my best guess; Let me know if something about that
explanation doesn't smell right)
On Wed, Sep 12, 2012 at 4:08 PM, n keywal <[EMAIL PROTECTED]> wrote:
> For each file; there is a time range. When you scan/search, the file is
> skipped if there is no overlap between the file timerange and the timerange
> of the query. As there are other parameters as well (row distribution,
> compaction effects, cache, bloom filters, ...) it's difficult to know in
> advance what's going to happen exactly. But specifying a timerange does no
> harm for sure, if it matches your functional needs...
> This said, if you already have the rowkey, the time range is less
> interesting as you will skip a lot of file already.
> On Wed, Sep 12, 2012 at 11:52 PM, Tom Brown <[EMAIL PROTECTED]> wrote:
>> When I query HBase, I always include a time range. This has not been a
>> problem when querying recent data, but it seems to be an issue when I
>> query older data (a few hours old). All of my row keys include the
>> timestamp as part of the key (this value is the same as the HBase
>> timestamp for the row). I recently tried an experiment where I
>> manually re-seek to the possible row (based on the timestamp as part
>> of the row key) instead of using "setTimeRange" on my scan object and
>> was amazed to see that there was no degradation for older data.
>> Can someone postulate a theory as to why this might be happening? I'm
>> happy to provide extra data if it will help you theorize...
>> Is there a downside to stopping using "setTimeRange"?