Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Advices for HTable schema


Copy link to this message
-
Re: Advices for HTable schema
Jean-Marc,

These are great questions! Find my answers (and some questions for you) inline.

-ak
On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote:

> Hi,
>
> I have a question regarding the best way to design a table.
>
> Let's imagine I want to store all the people in the world on a database.
>
> Everyone has a name, last name, phone number, lot of flags (sex, age, etc.).
>
> Now, people can have one address, but they can also have 2, or 3, or
> even more... But they will never have thousands of addresses. Let's
> say, usually, they have between 1 and 10.
>
>

The point to think about here is - what will be your read access pattern? Will you always want the latest address? Or will you want all addresses every time? And then also defining the maximum number of addresses to be stored.
>
> My table is designes like that.
>
> create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS
> => 1, COMPRESSION => 'gz'}
>
>

You could easily bump up the versions to a number that limits the max number of addresses you will store.
Having two separate column families is not the way to solve this problem in my opinion. Reason being - the concept of column families enables you to isolate data with different access patterns. If that's what you desire here, multiple families make sense. But again, this goes back to defining your read patterns. Are you going to access all the data together or are the addresses going to be accessed independently of the rest of the information.
>
> The 'a' CF will contain all the informations exepct the address.
> The 'b' CF will contain only the address.
>
> I have few options to store the addresses.
> I can:
> - Store in CF 'a' a flag to tell how many addresses there is and store
> "add1" to "addx" in the 'b' CF will each cell containing the address.
>
>

This sort of becomes a case where you'll need to build a transaction like logic in your client code. When you want to store an additional address, you'll need to do the following:
1. read counter from 'a'. Let's say that is n.
2. store next address with CQ as add[n++]
3. store n++ as the counter

That complicates the client code and is undesirable. Moreover, you are accessing both column families at the time of any access to the address info. It is probably better to store the counter in 'b' instead of 'a' in this approach but you still have the complication of the transaction like logic.
> - Store in CF 'b' the addresses using an hash as the column identifier.
>
>

The hash doesn't buy you anything. How do you ensure that you are reading the latest address? Again, goes back to defining the read patterns.
> - Store in CF 'b' the addresses as the column identifier and simply
> put '1' in the cell, or a hash.
>
Same problem as the last approach.
>
> The first option give me very quick information about the number of
> addresses, but if I need to add one address, I have to update the 2
> CF. Same if I have to remove one.
> The second option will allow me to add any address even without
> checking if it's already there. I can remove one very quickly and add
> one very quickly. If I want to know the number of addresses, I have to
> retreive all the columns in the CF and count them. However, I'm
> storing almost the same information twice. One time with the address,
> one time with the hash (CRC32).
> The 3rd option has all the advantages of the second one but also, it's
> not storing the information twice. However, that might result in VERY
> long column names. And I'm not sure it's good. Like, if I just want to
> know how many address this person has, I will still need to download
> them totally on the client side to count them.
>
>

Long column qualifiers are perfectly fine and take the same amount of disk space as storing the data in the cells. I don't believe that should be a concern.
>
> I'm not able to find which solution I should use. All of them have
> some pros and cons. And I'm not advanced enought in HBase to forsee