|
|
-
Re: new to HBase/NoSQL, need some help with database designBryan Beaudreault 2011-12-16, 19:17
Disclaimer: I'm not a master at HBase schema design, so someone more
knowledgable feel free to refute the below. I'd say the first thing you should do is take a step back; jumping into HBase is not as easy as jumping into MySQL. If you are going to be working at scale, the efficiency of your design matters greatly, such as row key size, column qualifier size, number of columns, separation of columns by column family, etc. A lot of what MySQL abstracts away is put right in your face with HBase. This is great in a lot of ways, but provides for a steep learning curve for someone looking to get in on the scene. Before contemplating such a project, I'd recommend taking a few days studying the various documentation out there: http://hbase.apache.org/book.html and http://ofps.oreilly.com/titles/9781449396107/ are a good place to start. Learning the inner architecture of how KeyValues work and how data is retrieved by HBase is very important. One of the first things you'll hopefully learn is that HBase/NoSQL is not relational. It doesn't make sense to have a category_id "foreign key" column like you have in your current approach. There is no "third normal form" and such for NoSQL like there is for relation databases. To that point, compiling a list of SQL queries as a starting point for a NoSQL project is probably not the best starting point. If your data is highly relational, while its certainly possible to make it work in a non-relational system, it may not be recommended without some real expertise or time to learn the new paradigms. A couple of tips: 1) You mention sorting on columns. You're right, this is not provided by HBase. In HBase there is a single primary key, and that is the row key. This is sorted lexicographically, as you have already found out. Keep in mind that when you sort in MySQL, it takes your data set and loads it into memory to be sorted. HBase doesn't do this for you, but you could easily do it from your client, once you have the data you want. You will see this pattern a lot: MySQL does things for you that you now need to handle yourself, and realize that there isn't much magical about how MySQL is doing it. 2) There are filters that can do some of what you want, such as returning only rows where a column is empty. There are also coprocessors, depending on the release you are using, which can do some extra work on the region server before sending over the wire (such as more complex filtering or data manipulation that might be expensive to do locally). -- I'll take a very quick and naive stab at your specific example. There should be no id columns. If you want to have a list of categories and categories can have multiple children, maybe each child category would be a column on the parent category's row. e.g. rowkey = category name; columns = one column (qualifer = 0x00 byte array) for the main category data, and child categories are extra columns where qualifiers are the name. The value could be a protobuf or avro message with whatever fields are important for a category. If all keywords are linked to a category, you might have that be part of the protobuf/avro message. So for protobuf your message would be: message Category { repeated Keyword keyword = 1; optional string some_other_per_category_field = 2; } message Keyword { optional string name = 1; optional int32 score = 2; } Like I said, this was quick and naive. Some of the queries you mentioned above would be expensive with this approach. You could always keep another table (or even rowkey schema within the same table, or another column family) to keep separate incremented counters for particular statistics you are interested in. Just an untested idea. Hope this helps, Bryan On Fri, Dec 16, 2011 at 11:26 AM, Alwin Roosen <[EMAIL PROTECTED]> wrote: > Hello, > > > I have been suggested to use HBase for a project, but after reading > some manuals/guidelines, I am still not sure how to design the > database and getting more confused by the minute. I am new to any form |