Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - querying data on the basis of timestamp


Copy link to this message
-
Re: querying data on the basis of timestamp
Ted Yu 2013-03-14, 23:03
What you are asking looks similar to this:
HBASE-5010 Filter HFiles based on TTL

It went into 0.94.0

Cheers

On Thu, Mar 14, 2013 at 3:53 PM, Pankaj Gupta <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I have a question regarding query performance for rows greater than a
> timestamp. The use case is this:
> I want to find all the rows in a key range that have changed after a
> certain timestamp and upto a certain timestamp, i.e. exactly using this
> SCAN api:
> Scan    setTimeRange(long minStamp, long maxStamp)
>           Get versions of columns only within the specified timestamp
> range, [minStamp, maxStamp)
>
> Would this query go through all the rows in the key range or is there an
> optimization that makes it faster.
>
> I ask because I read about such an optimization in the following paper:
>
> http://oss.csie.fju.edu.tw/~tzu98/Apache%20Hadoop%20Goes%20Realtime%20at%20Facebook.pdf
>
> Here is the excerpt:
> "For data stored in HBase that is time-series or contains a specific,
> known timestamp, a special timestamp file selection algorithm
> was added. Since time moves forward and data is rarely inserted
> at a significantly later time than its timestamp, each HFile will
> generally contain values for a fixed range of time. This
> information is stored as metadata in each HFile and queries that
> ask for a specific timestamp or range of timestamps will check if
> the request intersects with the ranges of each file, skipping those
> which do not overlap. "
>
>
> This will work perfectly for my use case but I don't know if this
> optimization, or any other for this use case, exists in the Apache HBase.
> The version of Apache HBASE we are currently using is 0.92.1 but
> considering moving to 0.94.
>
> Thanks,
> Pankaj