Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Questions on Table design for time series data


Copy link to this message
-
Re: Questions on Table design for time series data
Karthikeyan Muthukumarasa... 2012-10-05, 04:52
Jacques: I think you got me wrong on my statement. I was only requesting
you to think again about my questions assuming that I have seen the jive
video, since there are some differences in our case compared to jive. I
completely understand that all this is voluntary effort and my sincere
thanks for your suggestions. I will work through them and get back with
updates. Thanks again!
On Thu, Oct 4, 2012 at 12:29 AM, Jacques <[EMAIL PROTECTED]> wrote:

> We're all volunteers here so we don't always have the time to fully
> understand and plan others' schemas.
>
> In general your questions seemed to be worried about a lot of things that
> may or may not matter depending on the specifics of your implementation.
>  Without knowing those specifics it is hard to be super definitive.  You
> seem to be very worried about the cost of compactions and retention.  Is
> that because you're having issues now?
>
> Short answers:
>
> q1: Unless you have a good reason for splitting up into two tables, I'd
> keep as one.  Pros: Easier to understand/better matches intellectual
> understanding/allows checkAndPuts across both families/data is colocated
> (server, not disk) on retrieval if you want to work with both groups
> simultaneously using get, MR, etc.  Con: There will be some extra
> merge/flush activity if the two columns grow at substantially different
> rates.
>
> q2: 365*1000 regions is problematic (if that is what you're suggesting).
>  Even with HFilev2 and partially loaded multi-level indexes, there is still
> quite a bit of overhead per region.  I pointed you at the Jive thing in
> part since hashing that value as a bucket seems a lot more reasonable.
>  Additional Random idea: if you know retention policy on insert and your
> data is immutable post insertion, consider shifting the insert timestamp
> and maintain a single ttl.  Would require more client side code but would
> allow configurable ttls while utilizing existing HBase infrastructure.
>
> q3: Sounds like you're prematurely optimizing here.  Maybe others would
> disagree.  I'd use ttl until you find that isn't performant enough.  The
> tension between flexibility and speed is clear here.  I'd say you either
> need to pick specific ttls and optimize for that scenario via region
> pruning (e.g. separate tables for each ttl type) or you need to use a more
> general approach that leverages the per value ttl and compaction
> methodology.  There is enough operational work managing an HBase/HDFS
> cluster without having to worry about specialized region management.
>
> Jacques
>
> On Wed, Oct 3, 2012 at 11:31 AM, Karthikeyan Muthukumarasamy <
> [EMAIL PROTECTED]> wrote:
>
> > Hi Jacques,
> > Thanks for the response!
> > Yes, I have seen the video before. It suggets usage of TTL based
> retention
> > implementation. In their usecase, Jive has a fixed retention say 3 months
> > and so they can pre-create regions for so many buckets, their bucket id
> is
> > DAY_OF_YEAR%retention_in_days. But, in our usecase, the retention period
> is
> > configurable, so pre-creationg regions based on retention will not work.
> > Thats why we went for MMDD based buckets which is immune to retention
> > period changes.
> > Now that you know that Ive gone through that video from Jive, I would
> > request you to re-read my specific questions and share your suggestions.
> > Thanks & Regards
> > MK
> >
> >
> >
> > On Wed, Oct 3, 2012 at 11:51 PM, Jacques <[EMAIL PROTECTED]> wrote:
> >
> > > I would suggest you watch this video:
> > >
> > >
> >
> http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/
> > >
> > > The jive guys solved a lot of the problems you're talking about and
> > discuss
> > > it in that case study.
> > >
> > >
> > >
> > > On Wed, Oct 3, 2012 at 6:27 AM, Karthikeyan Muthukumarasamy <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Hi,
> > > > Our usecase is as follows:
> > > > We have time series data continuously flowing into the system and has