Mike and I get into good discussions about ERD modeling and HBase a lot ... :)
Mike's right that you should avoid a design that relies heavily on relationships when modeling data in HBase, because relationships are tricky (they're the first thing that gets throw out the window in a database that can scale to huge data sets, because enforcing them is more trouble than its worth; as is supporting normalization, joins, etc). If you start with a traditional ERD, you're more likely to fall into this trap, because you're "used to" normalizing the crap out of your entities.
But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
Once you understand what your entities really are, and how they relate to each other, you have pretty limited choices for how to represent multiple independent entities in HBase:
1) In unrelated tables. You just put authors in one table, titles in another, and genres in a third. You do all the work of joining and maintaining cross-entity integrity yourself (if needed). This is the default mode in HBase: "you worry about it". And that works great in many simple cases. This is appropriate if your "hard problem" is scaling a small set of simple entities to massive size, and you can take the hit for the application complexity that follows.
2) Scrunched into one table. You figure out the most important entity, and make that *the* table, with all other data stuffed into it. In simple cases, this could be columns that hold JSON; in advanced cases, you could use many columns to "nest" other entities in an intra-row version of denormalization. For example, have the row key of the HBase table be something like "Author ID", and then have a repeating column series for their titles, with column names like "title:1234", "title:5678", etc. This isn't a very common model, because you have to jump through some hoops in HBase (e.g. in this model, the way you would scan over authors differs from how you'd "scan over" titles for an author or across authors). The only real advantage to this over other forms of denormalization is that HBase guarantees intra-row ACID properties, so you're guaranteed to get all or none of the updates to the row (i.e. you don't have to reason about the failure cases). This can (but does *not* have to) use different column families for the different "entities" inside the row.
3) Denormalized across many tables. When you write to HBase, you write in multiple layouts: the Author table also contains a list of their titles, the Title table has author name & other info, etc. This basically equates to doing extra work at write time so you don't have to write code that does arbitrary joins and index usage at read time; in exchange, you get slower and more complex writes, but faster and simpler reads from different access paths. (It's still quite tricky, because you have to handle failure cases--what if one table gets written but the other doesn't?)
4) Normalized, with help from custom coprocessors. You could write your own suite of coprocessors to automatically do database-like things for you, such as joins and secondary indexing. I wouldn't recommend this route unless you're doing them in a general enough way to share. For example, Phoenix has an aggregation component that's built as a coprocessor and works really well; and it's applicable to anyone who wants to use Phoenix. You could build more stuff on this SQL framework, like indexes and joins and cascaded relationships and stuff. But that's a pretty massive undertaking for a single use case. :)
Maybe there are others I'm not thinking of, but I think these are basically your only choices. Mike, can you think of other basic approaches to representing more than one entity in HBase (where entity is defined as some repeating element in your data storage where individual instances are uniquely identifiable, possibly with one or more additional attributes)?
On Jul 5, 2013, at 12:48 PM, Michael Segel wrote:
Sorry, but you missed the point.
(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ... ;-)
The issue is what is and how to use Column families.
Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it.
The answer unfortunately is a bit more complicated than the questions.
You have to ask yourself when do you have a series of tables which have the same key value?
How do you access this data?
It gets more involved, but just looking at the answers to those two questions is a start.
Like I said, think about the order entry example and how the data is used in those column families.
Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing....
On Jul 5, 2013, at 11:16 AM, Aji Janis <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
I understand that there shouldn't be unlimited number of column families. I
am using this example on purpose to see how it comes into play.
On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>wrote:
Why do you have so many column fami