Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> querying data on the basis of timestamp


Copy link to this message
-
Re: querying data on the basis of timestamp
Thanks for looking at the code.

Recent improvement in this area was: HBASE-8063 Filter HFiles based on
first/last key

Cheers

On Fri, Mar 15, 2013 at 7:05 AM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:

> Hi Ted,
>
> Thanks for the response, it does look very relevant. Here's my
> understanding, (looking at the relevant code in the patch and around it):
> Each StoreFile knows the range of value timestamps that it contains, and it
> is kept in its metadata. When the store file is loaded this is available in
> the TimeRangeTracker object. When queries with timerange are made to a
> StoreFil, it filters them based on the knowledge of timerange values it
> contains. Thus if the timerange in query doesn't overlap with timerange of
> the store file then it will quickly return none without having to go
> through the entire contents of the file. This would mean that on a rowKey +
> timeRange query all StoreFiles corresponding to rowKey range will be hit
> but the ones that don't have overlapping time range will only result in a
> metadata lookup.
>
> Please correct me if I am wrong.
>
> Thanks Again,
> Pankaj
>
>
> On Thu, Mar 14, 2013 at 4:03 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > What you are asking looks similar to this:
> > HBASE-5010 Filter HFiles based on TTL
> >
> > It went into 0.94.0
> >
> > Cheers
> >
> > On Thu, Mar 14, 2013 at 3:53 PM, Pankaj Gupta <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi,
> > >
> > > I have a question regarding query performance for rows greater than a
> > > timestamp. The use case is this:
> > > I want to find all the rows in a key range that have changed after a
> > > certain timestamp and upto a certain timestamp, i.e. exactly using this
> > > SCAN api:
> > > Scan    setTimeRange(long minStamp, long maxStamp)
> > >           Get versions of columns only within the specified timestamp
> > > range, [minStamp, maxStamp)
> > >
> > > Would this query go through all the rows in the key range or is there
> an
> > > optimization that makes it faster.
> > >
> > > I ask because I read about such an optimization in the following paper:
> > >
> > >
> >
> http://oss.csie.fju.edu.tw/~tzu98/Apache%20Hadoop%20Goes%20Realtime%20at%20Facebook.pdf
> > >
> > > Here is the excerpt:
> > > "For data stored in HBase that is time-series or contains a specific,
> > > known timestamp, a special timestamp file selection algorithm
> > > was added. Since time moves forward and data is rarely inserted
> > > at a significantly later time than its timestamp, each HFile will
> > > generally contain values for a fixed range of time. This
> > > information is stored as metadata in each HFile and queries that
> > > ask for a specific timestamp or range of timestamps will check if
> > > the request intersects with the ranges of each file, skipping those
> > > which do not overlap. "
> > >
> > >
> > > This will work perfectly for my use case but I don't know if this
> > > optimization, or any other for this use case, exists in the Apache
> HBase.
> > > The version of Apache HBASE we are currently using is 0.92.1 but
> > > considering moving to 0.94.
> > >
> > > Thanks,
> > > Pankaj
> >
>
>
>
> --
>
>
> *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [EMAIL PROTECTED]
>
> Pankaj Gupta | Software Engineer
>
> *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
>
>
> United States | Canada | United Kingdom | Germany
>
>
> We're hiring<
> http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7
> >
> !
>