Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Questions on Table design for time series data


+
Karthikeyan Muthukumarasa... 2012-10-03, 13:27
+
Jacques 2012-10-03, 18:21
+
Karthikeyan Muthukumarasa... 2012-10-03, 18:31
+
Jacques 2012-10-03, 18:59
+
Eugeny Morozov 2012-10-03, 20:50
+
Karthikeyan Muthukumarasa... 2012-10-05, 04:53
Copy link to this message
-
Re: Questions on Table design for time series data
Jacques: I think you got me wrong on my statement. I was only requesting
you to think again about my questions assuming that I have seen the jive
video, since there are some differences in our case compared to jive. I
completely understand that all this is voluntary effort and my sincere
thanks for your suggestions. I will work through them and get back with
updates. Thanks again!
On Thu, Oct 4, 2012 at 12:29 AM, Jacques <[EMAIL PROTECTED]> wrote:

> We're all volunteers here so we don't always have the time to fully
> understand and plan others' schemas.
>
> In general your questions seemed to be worried about a lot of things that
> may or may not matter depending on the specifics of your implementation.
>  Without knowing those specifics it is hard to be super definitive.  You
> seem to be very worried about the cost of compactions and retention.  Is
> that because you're having issues now?
>
> Short answers:
>
> q1: Unless you have a good reason for splitting up into two tables, I'd
> keep as one.  Pros: Easier to understand/better matches intellectual
> understanding/allows checkAndPuts across both families/data is colocated
> (server, not disk) on retrieval if you want to work with both groups
> simultaneously using get, MR, etc.  Con: There will be some extra
> merge/flush activity if the two columns grow at substantially different
> rates.
>
> q2: 365*1000 regions is problematic (if that is what you're suggesting).
>  Even with HFilev2 and partially loaded multi-level indexes, there is still
> quite a bit of overhead per region.  I pointed you at the Jive thing in
> part since hashing that value as a bucket seems a lot more reasonable.
>  Additional Random idea: if you know retention policy on insert and your
> data is immutable post insertion, consider shifting the insert timestamp
> and maintain a single ttl.  Would require more client side code but would
> allow configurable ttls while utilizing existing HBase infrastructure.
>
> q3: Sounds like you're prematurely optimizing here.  Maybe others would
> disagree.  I'd use ttl until you find that isn't performant enough.  The
> tension between flexibility and speed is clear here.  I'd say you either
> need to pick specific ttls and optimize for that scenario via region
> pruning (e.g. separate tables for each ttl type) or you need to use a more
> general approach that leverages the per value ttl and compaction
> methodology.  There is enough operational work managing an HBase/HDFS
> cluster without having to worry about specialized region management.
>
> Jacques
>
> On Wed, Oct 3, 2012 at 11:31 AM, Karthikeyan Muthukumarasamy <
> [EMAIL PROTECTED]> wrote:
>
> > Hi Jacques,
> > Thanks for the response!
> > Yes, I have seen the video before. It suggets usage of TTL based
> retention
> > implementation. In their usecase, Jive has a fixed retention say 3 months
> > and so they can pre-create regions for so many buckets, their bucket id
> is
> > DAY_OF_YEAR%retention_in_days. But, in our usecase, the retention period
> is
> > configurable, so pre-creationg regions based on retention will not work.
> > Thats why we went for MMDD based buckets which is immune to retention
> > period changes.
> > Now that you know that Ive gone through that video from Jive, I would
> > request you to re-read my specific questions and share your suggestions.
> > Thanks & Regards
> > MK
> >
> >
> >
> > On Wed, Oct 3, 2012 at 11:51 PM, Jacques <[EMAIL PROTECTED]> wrote:
> >
> > > I would suggest you watch this video:
> > >
> > >
> >
> http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/
> > >
> > > The jive guys solved a lot of the problems you're talking about and
> > discuss
> > > it in that case study.
> > >
> > >
> > >
> > > On Wed, Oct 3, 2012 at 6:27 AM, Karthikeyan Muthukumarasamy <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Hi,
> > > > Our usecase is as follows:
> > > > We have time series data continuously flowing into the system and has
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB