Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Questions on Table design for time series data


Copy link to this message
-
Re: Questions on Table design for time series data
I would suggest you watch this video:
http://www.cloudera.com/resource/video-hbasecon-2012-real-performance-gains-with-real-time-data/

The jive guys solved a lot of the problems you're talking about and discuss
it in that case study.

On Wed, Oct 3, 2012 at 6:27 AM, Karthikeyan Muthukumarasamy <
[EMAIL PROTECTED]> wrote:

> Hi,
> Our usecase is as follows:
> We have time series data continuously flowing into the system and has to be
> stored in HBase.
> Subscriber Mobile Number (a.k.a MSISDN) is the primary identifier based on
> which data is stored and later retrieved.
> There are two sets of parameters that get stored in every record in HBase,
> lets call them group1 and group2. The number of records that would have
> group1 parameters would be approx. 6 per day and the same for group2
> parameters is approx. 1 per 3 days (their cardinality is different).
>
> Typically, the retention policy for group1 parameters is 3 months and for
> group2 parameters is 1 year. The read-pattern is as follows: An online
> query would ask for records matching an MSISDN for a given date range, and
> the system needs to respond with all available data (both from group1 and
> group2) satifying the MSISDN and data range filters.
>
> Question1:
> Alternative1: Create a single table with G1 and G2 as two column families.
> Alternative2: Create two tables one for each group
> Which is the better alternative and what are the pros and cons?
>
>
> Question2:
> To achieve max. distribution during write and reasonable complexity during
> read, we decided on the following row key design:
> <last 3 digits of MSISDN>,<MMDD>,<full MSISDN>
> We will manually pre-split regions for the table based on the <last 3
> digits of MSISDN>,<MMDD> part of row key
> So there are 1000 (from 3 digits of MSISDN) * 365 (from MMDD) buckets that
> would translate to as many regions
> In this case, when retention is configured as < 1 year, the design looks
> optimal
> When retention is configured > 1 year, one region might store data for more
> than 1 day (feb 1 of 2012 and also feb 1 of 2013), which means more data is
> to be handled by hbase during compactions and read.
> An alternative Key design, which does not have the above disadvantage is:
> <last 3 digits of MSISDN>,<YYYYMMDD>,<full MSISDN>
> this way, in one region, there will be only 1 days data at any point,
> regardless of retention
> What are other pros & cons of the two key designs?
>
> Question3:
> In our usecase, delete happens only based on retention policy, where one
> days full data has to be deleted when rention period is crossed (for eg, if
> retention is 30 days, on Apr 1 all the data for Mar 1 is deleted)
> What is the most optimal way to implement this retention policy?
> Alternative 1: TTL for column famil is configured and we leave it to HBase
> to delete data during major compaction, but we are not sure of the cost of
> this major compaction happening in all regions at same time
> Alternative 2: Through key design logic mentioned before, if we ensure data
> for one day goes into one set of regions, can we use HBase APIs like
> HFileArchiver to programatically archive and drop regions?
>
> Thanks & Regards
> MK
>