|
|
-
Re: Questions on Table design for time series dataKarthikeyan Muthukumarasa... 2012-10-05, 04:52
Jacques: I think you got me wrong on my statement. I was only requesting
you to think again about my questions assuming that I have seen the jive video, since there are some differences in our case compared to jive. I completely understand that all this is voluntary effort and my sincere thanks for your suggestions. I will work through them and get back with updates. Thanks again! On Thu, Oct 4, 2012 at 12:29 AM, Jacques <[EMAIL PROTECTED]> wrote: > We're all volunteers here so we don't always have the time to fully > understand and plan others' schemas. > > In general your questions seemed to be worried about a lot of things that > may or may not matter depending on the specifics of your implementation. > Without knowing those specifics it is hard to be super definitive. You > seem to be very worried about the cost of compactions and retention. Is > that because you're having issues now? > > Short answers: > > q1: Unless you have a good reason for splitting up into two tables, I'd > keep as one. Pros: Easier to understand/better matches intellectual > understanding/allows checkAndPuts across both families/data is colocated > (server, not disk) on retrieval if you want to work with both groups > simultaneously using get, MR, etc. Con: There will be some extra > merge/flush activity if the two columns grow at substantially different > rates. > > q2: 365*1000 regions is problematic (if that is what you're suggesting). > Even with HFilev2 and partially loaded multi-level indexes, there is still > quite a bit of overhead per region. I pointed you at the Jive thing in > part since hashing that value as a bucket seems a lot more reasonable. > Additional Random idea: if you know retention policy on insert and your > data is immutable post insertion, consider shifting the insert timestamp > and maintain a single ttl. Would require more client side code but would > allow configurable ttls while utilizing existing HBase infrastructure. > > q3: Sounds like you're prematurely optimizing here. Maybe others would > disagree. I'd use ttl until you find that isn't performant enough. The > tension between flexibility and speed is clear here. I'd say you either > need to pick specific ttls and optimize for that scenario via region > pruning (e.g. separate tables for each ttl type) or you need to use a more > general approach that leverages the per value ttl and compaction > methodology. There is enough operational work managing an HBase/HDFS > cluster without having to worry about specialized region management. > > Jacques > > On Wed, Oct 3, 2012 at 11:31 AM, Karthikeyan Muthukumarasamy < > [EMAIL PROTECTED]> wrote: > > > Hi Jacques, > > Thanks for the response! > > Yes, I have seen the video before. It suggets usage of TTL based > retention > > implementation. In their usecase, Jive has a fixed retention say 3 months > > and so they can pre-create regions for so many buckets, their bucket id > is > > DAY_OF_YEAR%retention_in_days. But, in our usecase, the retention period > is > > configurable, so pre-creationg regions based on retention will not work. > > Thats why we went for MMDD based buckets which is immune to retention > > period changes. > > Now that you know that Ive gone through that video from Jive, I would > > request you to re-read my specific questions and share your suggestions. > > Thanks & Regards > > MK > > > > > > > > On Wed, Oct 3, 2012 at 11:51 PM, Jacques <[EMAIL PROTECTED]> wrote: > > > > > I would suggest you watch this video: > > > > > > > > > http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/ > > > > > > The jive guys solved a lot of the problems you're talking about and > > discuss > > > it in that case study. > > > > > > > > > > > > On Wed, Oct 3, 2012 at 6:27 AM, Karthikeyan Muthukumarasamy < > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > Our usecase is as follows: > > > > We have time series data continuously flowing into the system and has |