Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> querying data on the basis of timestamp


Copy link to this message
-
Re: querying data on the basis of timestamp
Thanks for looking at the code.

Recent improvement in this area was: HBASE-8063 Filter HFiles based on
first/last key

Cheers

On Fri, Mar 15, 2013 at 7:05 AM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:

> Hi Ted,
>
> Thanks for the response, it does look very relevant. Here's my
> understanding, (looking at the relevant code in the patch and around it):
> Each StoreFile knows the range of value timestamps that it contains, and it
> is kept in its metadata. When the store file is loaded this is available in
> the TimeRangeTracker object. When queries with timerange are made to a
> StoreFil, it filters them based on the knowledge of timerange values it
> contains. Thus if the timerange in query doesn't overlap with timerange of
> the store file then it will quickly return none without having to go
> through the entire contents of the file. This would mean that on a rowKey +
> timeRange query all StoreFiles corresponding to rowKey range will be hit
> but the ones that don't have overlapping time range will only result in a
> metadata lookup.
>
> Please correct me if I am wrong.
>
> Thanks Again,
> Pankaj
>
>
> On Thu, Mar 14, 2013 at 4:03 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
> > What you are asking looks similar to this:
> > HBASE-5010 Filter HFiles based on TTL
> >
> > It went into 0.94.0
> >
> > Cheers
> >
> > On Thu, Mar 14, 2013 at 3:53 PM, Pankaj Gupta <[EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi,
> > >
> > > I have a question regarding query performance for rows greater than a
> > > timestamp. The use case is this:
> > > I want to find all the rows in a key range that have changed after a
> > > certain timestamp and upto a certain timestamp, i.e. exactly using this
> > > SCAN api:
> > > Scan    setTimeRange(long minStamp, long maxStamp)
> > >           Get versions of columns only within the specified timestamp
> > > range, [minStamp, maxStamp)
> > >
> > > Would this query go through all the rows in the key range or is there
> an
> > > optimization that makes it faster.
> > >
> > > I ask because I read about such an optimization in the following paper:
> > >
> > >
> >
> http://oss.csie.fju.edu.tw/~tzu98/Apache%20Hadoop%20Goes%20Realtime%20at%20Facebook.pdf
> > >
> > > Here is the excerpt:
> > > "For data stored in HBase that is time-series or contains a specific,
> > > known timestamp, a special timestamp file selection algorithm
> > > was added. Since time moves forward and data is rarely inserted
> > > at a significantly later time than its timestamp, each HFile will
> > > generally contain values for a fixed range of time. This
> > > information is stored as metadata in each HFile and queries that
> > > ask for a specific timestamp or range of timestamps will check if
> > > the request intersects with the ranges of each file, skipping those
> > > which do not overlap. "
> > >
> > >
> > > This will work perfectly for my use case but I don't know if this
> > > optimization, or any other for this use case, exists in the Apache
> HBase.
> > > The version of Apache HBASE we are currently using is 0.92.1 but
> > > considering moving to 0.94.
> > >
> > > Thanks,
> > > Pankaj
> >
>
>
>
> --
>
>
> *P* | (415) 677-9222 ext. 205 *F *| (415) 677-0895 | [EMAIL PROTECTED]
>
> Pankaj Gupta | Software Engineer
>
> *BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
>
>
> United States | Canada | United Kingdom | Germany
>
>
> We're hiring<
> http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7
> >
> !
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB