Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - When to expand vertically vs. horizontally in Hbase


Copy link to this message
-
Re: When to expand vertically vs. horizontally in Hbase
Michael Segel 2013-07-05, 17:48
Sorry, but you missed the point.

(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ...  ;-)

Look,

The issue is what is and how to use Column families.

Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it.

The answer unfortunately is a bit more complicated than the questions.

You have to ask yourself when do you have a series of tables which have the same key value?
How do you access this data?

It gets more involved, but just looking at the answers to those two questions is a start.

Like I said, think about the order entry example and how the data is used in those column families.

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis <[EMAIL PROTECTED]> wrote:

> I understand that there shouldn't be unlimited number of column families. I
> am using this example on purpose to see how it comes into play.
>
>
> On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> Why do you have so many column families (CF) ?
>>
>> Its not a question on the physical limitations, but more on the issue of
>> data design.
>>
>> There aren't that many really good examples of where you would have
>> multiple column families that would require more than a handful of CFs.
>>
>> When I teach or lecture, the example I use is an order entry system.
>> Where you would have the same key on Order entry, pick slips, shipping,
>> and invoice.
>>
>> That's probably the best example of where CFs come in to play.
>>
>> I'd suggest that you go back and rethink the design if you're having more
>> than a handful.
>>
>>
>>
>> On Jul 5, 2013, at 8:53 AM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>
>>> Asaf,
>>>
>>> I am using the Genre/Author stuff as an example but yes at the moment I
>>> only have 5 column families. However, over time I may have more (no upper
>>> limit decided that this point). See below for more responses
>>>
>>>
>>> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <[EMAIL PROTECTED]>
>> wrote:
>>>
>>>> Do you have only 5 static author names?
>>>> Keep in mind the column family name is defined when creating the table.
>>>>
>>>> Regarding tall vs wide debate:
>>>> HBase is first and for most a Key Value database thus reads and writes
>> in
>>>> the column-value level. So it doesn't really care about rows.
>>>> But it's not entirely true. Rows come into play in the following
>>>> situations:
>>>> Splitting a region is per row and not per column, thus a row will be
>> saved
>>>> as a whole on a region. If you have a really large row, the region size
>>>> granularity is dependent on it. It doesn't seem to be the case here.
>>>> Put/Delete creates a lock until finished. If you are intensive on
>> inserts
>>>> to the same row at the same time, thus might be bad for you, keeping
>> your
>>>> rows slimmer can reduce contention, but again, only if you make a lot
>>>> concurrent modifications to the same row.
>>>>
>>>
>>> I expect batches of Put/Delete to the same row to happen by at most one
>>> thread at a time based on user's current behavior. So locking shouldn't
>> be
>>> an issue. However, not sure if the saving row to a region with enough
>> space
>>> topic is really an issue I need to worry about (probably because I just
>>> don't know much about it yet).
>>>
>>>
>>>> Filtering - if you need a filter which need all the row (there is a
>> method
>>>> you override in Filter to mark that) than a far row will be more memory
>>>> intensive. If you needed only 1/5 of your row, than maybe splitting it