Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Schema Design Newbie Question


Copy link to this message
-
Re: Schema Design Newbie Question
A 1000 CFs with HBase does not sound like a good idea. 

category + timestamp sounds like the better of the 2 options you have thought of. 

Can you tell us a little more about your data? 
 
Regards,

Dhaval
________________________________
 From: Kamal Bahadur <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, 23 December 2013 6:01 PM
Subject: Schema Design Newbie Question
 

Hello,

I am just starting to use HBase and I am coming from Cassandra world.Here
is a quick background regarding my data:

My system will be storing data that belongs to a certain category.
Currently I have around 1000 categories.  Also note that some categories
produce lot more data than others. To be precise, 10% of the categories
provide more than 65% of the total data in the system.

Data access queries always contains this category in the query. I have
listed 2 options to design the schema:

1. Add category as first component of the row key [category + timestamp] so
that my data is sorted based on category for fast retrieval.
2. Add category as column family so that I can just use timestamp as
rowkey. This option will however create more hfiles since I have more
categories.

I am leaning towards option2. I like the idea that HBase separates data for
each CF into its own HFiles. However I still worried about the number of
hfiles that will be created on the server. Will it cause any other side
effects? I would like to hear from the user community as to which option
will be the best option in my case.

Kamal
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB