Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Advices for HTable schema

Jean-Marc Spaggiari 2012-07-02, 19:04
Amandeep Khurana 2012-07-02, 19:21
Jean-Marc Spaggiari 2012-07-02, 19:53
Amandeep Khurana 2012-07-02, 20:08
Jean-Marc Spaggiari 2012-07-02, 23:48
Amandeep Khurana 2012-07-02, 23:52
Jean-Marc Spaggiari 2012-07-03, 01:56
Copy link to this message
Re: Advices for HTable schema
Michael Segel 2012-07-03, 11:57

You're over thinking this.

Take a step back and remember that you can store anything you want as a byte stream in a column.

So you have a record that could be a text blob. Store it in one column. Use JSON to define its structure and fields.

The only thing that makes it difficult is that you will need to pull out everything just to insert or update something.
So then maybe segment your data in to logical blocks. Like a column that stores the physical attributes of the person.
Another column that stores the list of addresses for the person.
Another column that stores the list of aliases used by the person.

Don't think in relational terms. HBase isn't relational and ER is not the best way to model in a NoSQL database.
Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS.

The only relationships in HBase are weak relationships between tables.
Column Families currently have some nasty side effects that you may want to consider how you apply them.

Think in terms of records.

Look at storing data using Avro.

On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote:

> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>:
>>> Here are the 2 options now. Both with a new table.
>>> 1) I store the key "personID" and a:a1 to a:an for the addresses.
>>> 2) I store the key "personID" + "address
>>> In both I will have the same amount of data. In #1 total size will be
>>> smaller since the key will be stored only once.
>> The size will be the same. The underlying HFile will store 1 row per cell
>> and the number of cells in both cases is the same.
>> However, the first approach with multiple columns for addresses needs you to
>> keep track of the number and makes updates, deletes, additions complicated
>> as I highlighted earlier. The second option with putting both things in the
>> key makes life much easier.
>> If the data is primarily being accessed independently, I'd go with option 2.
> Oh! I see! My misunderstanding comes from from my lack of HBase
> knowledge/reflex. I forgot it was storing the data that way. So I
> think I will most probably give a try to this 2nd option! Thanks for
> sharing your ideas all over the day.
> JM
Jean-Marc Spaggiari 2012-07-03, 12:31
Michael Segel 2012-07-03, 14:03