Our usecase is as follows:
We have time series data continuously flowing into the system and has to be
stored in HBase.
Subscriber Mobile Number (a.k.a MSISDN) is the primary identifier based on
which data is stored and later retrieved.
There are two sets of parameters that get stored in every record in HBase,
lets call them group1 and group2. The number of records that would have
group1 parameters would be approx. 6 per day and the same for group2
parameters is approx. 1 per 3 days (their cardinality is different).
Typically, the retention policy for group1 parameters is 3 months and for
group2 parameters is 1 year. The read-pattern is as follows: An online
query would ask for records matching an MSISDN for a given date range, and
the system needs to respond with all available data (both from group1 and
group2) satifying the MSISDN and data range filters.
Alternative1: Create a single table with G1 and G2 as two column families.
Alternative2: Create two tables one for each group
Which is the better alternative and what are the pros and cons?
To achieve max. distribution during write and reasonable complexity during
read, we decided on the following row key design:
<last 3 digits of MSISDN>,<MMDD>,<full MSISDN>
We will manually pre-split regions for the table based on the <last 3
digits of MSISDN>,<MMDD> part of row key
So there are 1000 (from 3 digits of MSISDN) * 365 (from MMDD) buckets that
would translate to as many regions
In this case, when retention is configured as < 1 year, the design looks
When retention is configured > 1 year, one region might store data for more
than 1 day (feb 1 of 2012 and also feb 1 of 2013), which means more data is
to be handled by hbase during compactions and read.
An alternative Key design, which does not have the above disadvantage is:
<last 3 digits of MSISDN>,<YYYYMMDD>,<full MSISDN>
this way, in one region, there will be only 1 days data at any point,
regardless of retention
What are other pros & cons of the two key designs?
In our usecase, delete happens only based on retention policy, where one
days full data has to be deleted when rention period is crossed (for eg, if
retention is 30 days, on Apr 1 all the data for Mar 1 is deleted)
What is the most optimal way to implement this retention policy?
Alternative 1: TTL for column famil is configured and we leave it to HBase
to delete data during major compaction, but we are not sure of the cost of
this major compaction happening in all regions at same time
Alternative 2: Through key design logic mentioned before, if we ensure data
for one day goes into one set of regions, can we use HBase APIs like
HFileArchiver to programatically archive and drop regions?
Thanks & Regards
Jacques 2012-10-03, 18:21
Karthikeyan Muthukumarasa... 2012-10-03, 18:31
Jacques 2012-10-03, 18:59
Eugeny Morozov 2012-10-03, 20:50
Karthikeyan Muthukumarasa... 2012-10-05, 04:53
Karthikeyan Muthukumarasa... 2012-10-05, 04:52