It all depends on your key-design overlap with the use cases you want to address. If (all of) your use cases map very closely to your key design you're in good hands otherwise some tricks are warranted like more tables with duplicated data, pre-computations through M/R jobs etc.
Very Long answer:
In my experience, schema design [actually index-key design] is one of the trickiest part of HBase. It is unfortunate that one needs to understand the internal architecture to be able to extract optimal utilization and performance from HBase for all but most silo'd usecases. It is ironic too, as schema flexilibility is one of the pillars on which NoSQL movement stands and that HBase provides it only partially by making the schema dynamically extensible pretty much but having index-key to be a wedlock from start.
Now don't get me wrong. There are pros and cons with every technology. A bit of such insight and tricks are required on the SQL side too. For Eg: De-normalization and no FK references in SQL schemas go against the best practices but work out much better in practice at scale. It is just that there is a better knowledge base now due to SQL stores being in deployment for a long time. That's why SQL stores schema design seem like a "mostly solved" problem. I bet, in the 70s when the technology was coming up, schema design was not as commonly understood.
Anyways, here is how I understand HBase as:
- Sorted Key value pairs storage [sorted on key]
- Data retrieval by specifying key [pattern].
- Composite key design.
- Storage is hierarchically grouped based on what elements comprise the key. Thus optimization is naturally possible on those lines.
* The hierarchy is limited to 3-4 levels depending on how you count.
- Only one way to sort: The key that you define : thus effectively only one index per table.
- Distributed storage - scales horizontally with data volume
- Multiversioned cells: same row+column combination can store many versions of data [ mostly versioned by timestamp]
- GET calls based on a specific key works great real time lookups.
- Less contention between PUTs and GETs on the same "row". I think(?) the contention is at the cell level.
- If your storage pattern is a sparse matrix and you are interested only in a group of columns at a time per row.
- Exploit Hadoop's strength of M/R jobs on the same data: so no data duplication.*
- Other Hadoop benefits like redundancy, replication etc.
Not useful for:
- Range queries in real time across lots of rows [esp. when range filter criteria don't go well with the index design]
- GETs requiring all columns of that sparse table all the time.
- Group By/Top-K/count(*) kind of real-time queries
- Sorting/counting on value for real time queries [ esp. across rows]
- Sorting in a different combination of key-elements than how they are laid out in the key.
- Joins across tables.
So, HBase is very good where its strengths are but for sure, I won't say all SQL loads can be transferred to HBase with the same or better performance expectations. From your previous mail, it seems your queries are more SQL-like and actually, at the risk of being considered outcast here, but my honest advice would be to also look into more document oriented data stores like Mongo which can scale to the volume you mentioned and may be able to support range queries on multiple indexes that you are looking for.
From: Jerry Lam [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 08, 2012 7:32 AM
Subject: Re: Nosqls schema design
Your question is a good and tough one. I haven't find anything that helps in guiding the schema design in the nosql world. There are general concepts but none of them is closed to the SQL schema design in which you can apply some rules to guiding your decision.
The best presentation I have found about the general concepts in hbase schema design is http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-hbasecon-2012.html
search for Schema Design. From this presentation, you can learn why it is so difficult to come up with a suggestion for your problem and learn some best practices to start your own design.
On Thu, Nov 8, 2012 at 10:17 AM, Nick maillard < [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: