Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> new to HBase/NoSQL, need some help with database design


Copy link to this message
-
Re: new to HBase/NoSQL, need some help with database design
Disclaimer: I'm not a master at HBase schema design, so someone more
knowledgable feel free to refute the below.

I'd say the first thing you should do is take a step back; jumping into
HBase is not as easy as jumping into MySQL.  If you are going to be working
at scale, the efficiency of your design matters greatly, such as row key
size, column qualifier size, number of columns, separation of columns by
column family, etc.  A lot of what MySQL abstracts away is put right in
your face with HBase. This is great in a lot of ways, but provides for a
steep learning curve for someone looking to get in on the scene.  Before
contemplating such a project, I'd recommend taking a few days studying the
various documentation out there: http://hbase.apache.org/book.html and
http://ofps.oreilly.com/titles/9781449396107/ are a good place to start.
 Learning the inner architecture of how KeyValues work and how data is
retrieved by HBase is very important.

One of the first things you'll hopefully learn is that HBase/NoSQL is not
relational.  It doesn't make sense to have a category_id "foreign key"
column like you have in your current approach.  There is no "third normal
form" and such for NoSQL like there is for relation databases.  To that
point, compiling a list of SQL queries as a starting point for a NoSQL
project is probably not the best starting point.  If your data is highly
relational, while its certainly possible to make it work in a
non-relational system, it may not be recommended without some real
expertise or time to learn the new paradigms.

A couple of tips:

1) You mention sorting on columns.  You're right, this is not provided by
HBase.  In HBase there is a single primary key, and that is the row key.
 This is sorted lexicographically, as you have already found out.  Keep in
mind that when you sort in MySQL, it takes your data set and loads it into
memory to be sorted.  HBase doesn't do this for you, but you could easily
do it from your client, once you have the data you want.  You will see this
pattern a lot: MySQL does things for you that you now need to handle
yourself, and realize that there isn't much magical about how MySQL is
doing it.

2) There are filters that can do some of what you want, such as returning
only rows where a column is empty.  There are also coprocessors, depending
on the release you are using, which can do some extra work on the region
server before sending over the wire (such as more complex filtering or data
manipulation that might be expensive to do locally).

--

I'll take a very quick and naive stab at your specific example.  There
should be no id columns.  If you want to have a list of categories and
categories can have multiple children, maybe each child category would be a
column on the parent category's row.  e.g. rowkey = category name; columns
= one column (qualifer = 0x00 byte array) for the main category data, and
child categories are extra columns where qualifiers are the name.  The
value could be a protobuf or avro message with whatever fields are
important for a category.  If all keywords are linked to a category, you
might have that be part of the protobuf/avro message.  So for protobuf your
message would be:

message Category {
     repeated Keyword keyword = 1;
     optional string some_other_per_category_field = 2;
}

message Keyword {
     optional string name = 1;
     optional int32 score = 2;
}

Like I said, this was quick and naive.  Some of the queries you mentioned
above would be expensive with this approach.  You could always keep another
table (or even rowkey schema within the same table, or another column
family) to keep separate incremented counters for particular statistics you
are interested in.  Just an untested idea.

Hope this helps,

Bryan

On Fri, Dec 16, 2011 at 11:26 AM, Alwin Roosen <[EMAIL PROTECTED]> wrote:

> Hello,
>
>
> I have been suggested to use HBase for a project, but after reading
> some manuals/guidelines, I am still not sure how to design the
> database and getting more confused by the minute. I am new to any form