The HDFS NameNode will have to deal with lots of small files (currently HBase cannot flush column families independently, so if one is flushed all of them are).
The other reason is that scanning will the slow (if your scan involves many column families, due to the merge sort HBase needs to perform).
Option #1 should be better. HBase will be smart just scanning the HFile necessary for the key range you provide (Category + Timestamp).
From: Kamal Bahadur <[EMAIL PROTECTED]>
To: user <[EMAIL PROTECTED]>; Dhaval Shah <[EMAIL PROTECTED]>
Sent: Monday, December 23, 2013 3:47 PM
Subject: Re: Schema Design Newbie Question
Thanks for the quick response!
Why do you think having more files is not a good idea? Is it because of OS
I get around 50 million records a day and each record contains ~25
columns. Values for each column are ~30 characters.
On Mon, Dec 23, 2013 at 3:35 PM, Dhaval Shah <[EMAIL PROTECTED]>wrote:
> A 1000 CFs with HBase does not sound like a good idea.
> category + timestamp sounds like the better of the 2 options you have
> thought of.
> Can you tell us a little more about your data?
> From: Kamal Bahadur <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Monday, 23 December 2013 6:01 PM
> Subject: Schema Design Newbie Question
> I am just starting to use HBase and I am coming from Cassandra world.Here
> is a quick background regarding my data:
> My system will be storing data that belongs to a certain category.
> Currently I have around 1000 categories. Also note that some categories
> produce lot more data than others. To be precise, 10% of the categories
> provide more than 65% of the total data in the system.
> Data access queries always contains this category in the query. I have
> listed 2 options to design the schema:
> 1. Add category as first component of the row key [category + timestamp] so
> that my data is sorted based on category for fast retrieval.
> 2. Add category as column family so that I can just use timestamp as
> rowkey. This option will however create more hfiles since I have more
> I am leaning towards option2. I like the idea that HBase separates data for
> each CF into its own HFiles. However I still worried about the number of
> hfiles that will be created on the server. Will it cause any other side
> effects? I would like to hear from the user community as to which option
> will be the best option in my case.