-Re: Questions on Table design for time series data
Karthikeyan Muthukumarasa... 2012-10-05, 04:53
Thanks Eugeny. We are currently running some experiments based on your
On Thu, Oct 4, 2012 at 2:20 AM, Eugeny Morozov <[EMAIL PROTECTED]>wrote:
> I'd suggest to think about manual major compactions and splits. Using
> manual compactions and bulkload allows to split HFiles manually. Like if
> you would like to read last 3 months more often that all others data, then
> you could have three HFiles for each month and one HFile for whole other
> stuff. Using scan.setTimestamps would allow to filter out only those three
> HFiles, thus scan would be faster.
> Moreover if you have TTL about one month there is no need to run it
> everyday (as in auto mode). Especially, when using bulkloads you basically
> control the size of output HFiles by having input of particular size. Say,
> you give input for last two weeks and have one HFile per regions for last
> two weeks.
> Using this new feature known as Coprocessor, you could hook up to the
> compactSelection process and alter the compaction chosing HFiles you would
> like to process. That allow to combine particular HFiles.
> All of that allow to run major compaction just once-twice in month - major
> compaction over huge amount of data is a heavy operation - the rare, the
> Though without monitoring and measurement it looks like early optimization.
> On Wed, Oct 3, 2012 at 10:59 PM, Jacques <[EMAIL PROTECTED]> wrote:
> > We're all volunteers here so we don't always have the time to fully
> > understand and plan others' schemas.
> > In general your questions seemed to be worried about a lot of things that
> > may or may not matter depending on the specifics of your implementation.
> > Without knowing those specifics it is hard to be super definitive. You
> > seem to be very worried about the cost of compactions and retention. Is
> > that because you're having issues now?
> > Short answers:
> > q1: Unless you have a good reason for splitting up into two tables, I'd
> > keep as one. Pros: Easier to understand/better matches intellectual
> > understanding/allows checkAndPuts across both families/data is colocated
> > (server, not disk) on retrieval if you want to work with both groups
> > simultaneously using get, MR, etc. Con: There will be some extra
> > merge/flush activity if the two columns grow at substantially different
> > rates.
> > q2: 365*1000 regions is problematic (if that is what you're suggesting).
> > Even with HFilev2 and partially loaded multi-level indexes, there is
> > quite a bit of overhead per region. I pointed you at the Jive thing in
> > part since hashing that value as a bucket seems a lot more reasonable.
> > Additional Random idea: if you know retention policy on insert and your
> > data is immutable post insertion, consider shifting the insert timestamp
> > and maintain a single ttl. Would require more client side code but would
> > allow configurable ttls while utilizing existing HBase infrastructure.
> > q3: Sounds like you're prematurely optimizing here. Maybe others would
> > disagree. I'd use ttl until you find that isn't performant enough. The
> > tension between flexibility and speed is clear here. I'd say you either
> > need to pick specific ttls and optimize for that scenario via region
> > pruning (e.g. separate tables for each ttl type) or you need to use a
> > general approach that leverages the per value ttl and compaction
> > methodology. There is enough operational work managing an HBase/HDFS
> > cluster without having to worry about specialized region management.
> > Jacques
> > On Wed, Oct 3, 2012 at 11:31 AM, Karthikeyan Muthukumarasamy <
> > [EMAIL PROTECTED]> wrote:
> > > Hi Jacques,
> > > Thanks for the response!
> > > Yes, I have seen the video before. It suggets usage of TTL based
> > retention
> > > implementation. In their usecase, Jive has a fixed retention say 3
> > > and so they can pre-create regions for so many buckets, their bucket id