Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Write TimeSeries Data and Do Time Based Range Scans


Copy link to this message
-
Re: Write TimeSeries Data and Do Time Based Range Scans
Inline

On Mon, Sep 23, 2013 at 6:15 PM, Shahab Yunus <[EMAIL PROTECTED]>wrote:

> Yeah, I saw that. In fact that is why I recommended that to you as I
> couldn't infer from your email that whether you have already gone through
> that source or not.

Yes, i was aware of that article. But my read pattern is slighty different
from that article.We are using HBase as DataSource for a RestFul service.
Even though if my range scan finds 400 rows with a specified timerange. I
only return top 20 for one rest request. So, if in case i do bucketing(lets
say bucket=10) then i will need to fetch 20 results from each bucket and
then i will have to do a merge sort on the client size and return final 20.
You can assume that i need to return the 20rows sorted by timestamp.

> A source, who did the exact same thing and discuss it
> in much more detail and concerns aligning with yours (in fact I think some
> of the authors/creators of that link/group are members here of this
> community as well.)

Do you know what the outcome of their experiment? Do you have any link for
that? Thanks for your time and help.
>
> Regards,
> Shahab
>
>
> On Mon, Sep 23, 2013 at 8:41 PM, anil gupta <[EMAIL PROTECTED]> wrote:
>
> > Hi Shahab,
> >
> > If you read my solution carefully. I am already doing that.
> >
> > Thanks,
> > Anil Gupta
> >
> >
> > On Mon, Sep 23, 2013 at 3:51 PM, Shahab Yunus <[EMAIL PROTECTED]
> > >wrote:
> >
> > >
> > >
> >
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >
> > > Here you can find the discussion, trade-offs and working code/API (even
> > for
> > > M/R) about this and the approach you are trying out.
> > >
> > > Regards,
> > > Shahab
> > >
> > >
> > > On Mon, Sep 23, 2013 at 5:41 PM, anil gupta <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have a secondary index(inverted index) table with a rowkey on the
> > basis
> > > > of Timestamp of an event. Assume the rowkey as <TimeStamp in Epoch>.
> > > > I also store some extra(apart from main_table rowkey) columns in that
> > > table
> > > > for doing filtering.
> > > >
> > > > The requirement is to do range-based scan on the basis of time of
> > > > event.  Hence, the index with this rowkey.
> > > > I cannot use Hashing or MD5 digest solution because then i cannot do
> > > range
> > > > based scans.  And, i already have a index like OpenTSDB in another
> > table
> > > > for the same dataset.(I have many secondary Index for same data set.)
> > > >
> > > > Problem: When we increase the write workload during stress test. Time
> > > > secondary index becomes a bottleneck due to the famous Region
> > HotSpotting
> > > > problem.
> > > > Solution: I am thinking of adding a prefix of { (<TimeStamp in
> > > Epoch>%10) > > > > bucket}  in the rowkey. Then my row key will become:
> > > >  <Bucket><TimeStamp in Epoch>
> > > > By using above rowkey i can at least alleviate *WRITE* problem.(i
> don't
> > > > think problem can be fixed permanently because of the use case
> > > requirement.
> > > > I would love to be proven wrong.)
> > > > However, with the above row key, now when i want to *READ* data, for
> > > every
> > > > single range scans i have to read data from 10 different regions.
> This
> > > > extra load for read is scaring me a bit.
> > > >
> > > > I am wondering if anyone has better suggestion/approach to solve this
> > > > problem given the constraints i have.  Looking for feedback from
> > > community.
> > > >
> > > > --
> > > > Thanks & Regards,
> > > > Anil Gupta
> > > >
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>

--
Thanks & Regards,
Anil Gupta