Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> When to expand vertically vs. horizontally in Hbase


Copy link to this message
-
Re: When to expand vertically vs. horizontally in Hbase
Sorry, but you missed the point.

(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ...  ;-)

Look,

The issue is what is and how to use Column families.

Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it.

The answer unfortunately is a bit more complicated than the questions.

You have to ask yourself when do you have a series of tables which have the same key value?
How do you access this data?

It gets more involved, but just looking at the answers to those two questions is a start.

Like I said, think about the order entry example and how the data is used in those column families.

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis <[EMAIL PROTECTED]> wrote:

> I understand that there shouldn't be unlimited number of column families. I
> am using this example on purpose to see how it comes into play.
>
>
> On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> Why do you have so many column families (CF) ?
>>
>> Its not a question on the physical limitations, but more on the issue of
>> data design.
>>
>> There aren't that many really good examples of where you would have
>> multiple column families that would require more than a handful of CFs.
>>
>> When I teach or lecture, the example I use is an order entry system.
>> Where you would have the same key on Order entry, pick slips, shipping,
>> and invoice.
>>
>> That's probably the best example of where CFs come in to play.
>>
>> I'd suggest that you go back and rethink the design if you're having more
>> than a handful.
>>
>>
>>
>> On Jul 5, 2013, at 8:53 AM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>
>>> Asaf,
>>>
>>> I am using the Genre/Author stuff as an example but yes at the moment I
>>> only have 5 column families. However, over time I may have more (no upper
>>> limit decided that this point). See below for more responses
>>>
>>>
>>> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <[EMAIL PROTECTED]>
>> wrote:
>>>
>>>> Do you have only 5 static author names?
>>>> Keep in mind the column family name is defined when creating the table.
>>>>
>>>> Regarding tall vs wide debate:
>>>> HBase is first and for most a Key Value database thus reads and writes
>> in
>>>> the column-value level. So it doesn't really care about rows.
>>>> But it's not entirely true. Rows come into play in the following
>>>> situations:
>>>> Splitting a region is per row and not per column, thus a row will be
>> saved
>>>> as a whole on a region. If you have a really large row, the region size
>>>> granularity is dependent on it. It doesn't seem to be the case here.
>>>> Put/Delete creates a lock until finished. If you are intensive on
>> inserts
>>>> to the same row at the same time, thus might be bad for you, keeping
>> your
>>>> rows slimmer can reduce contention, but again, only if you make a lot
>>>> concurrent modifications to the same row.
>>>>
>>>
>>> I expect batches of Put/Delete to the same row to happen by at most one
>>> thread at a time based on user's current behavior. So locking shouldn't
>> be
>>> an issue. However, not sure if the saving row to a region with enough
>> space
>>> topic is really an issue I need to worry about (probably because I just
>>> don't know much about it yet).
>>>
>>>
>>>> Filtering - if you need a filter which need all the row (there is a
>> method
>>>> you override in Filter to mark that) than a far row will be more memory
>>>> intensive. If you needed only 1/5 of your row, than maybe splitting it
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB