Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Advices for HTable schema

Jean-Marc Spaggiari 2012-07-02, 19:04
Amandeep Khurana 2012-07-02, 19:21
Jean-Marc Spaggiari 2012-07-02, 19:53
Amandeep Khurana 2012-07-02, 20:08
Jean-Marc Spaggiari 2012-07-02, 23:48
Amandeep Khurana 2012-07-02, 23:52
Jean-Marc Spaggiari 2012-07-03, 01:56
Michael Segel 2012-07-03, 11:57
Jean-Marc Spaggiari 2012-07-03, 12:31
Copy link to this message
Re: Advices for HTable schema
Comparisons are fine.

Try to not think of this in terms of rows and columns, but in terms of records.
Think of each record as being atomic.  
Create a list of all of the components that make up that record.
Then combine like components in to structures.

Like the Street Address.  Add in a couple of fields to suggest when the person lived there. If there is no end date, it must be a current address.
You could put them in an Array however array's imply a finite size. Ordered set or list would be more appropriate.
Each of these structures then becomes a column.
On Jul 3, 2012, at 7:31 AM, Jean-Marc Spaggiari wrote:

> Hi Michael,
> I'm trying to deeply dive into HBase and forget all my RDBMS knowledge
> but sometime it's difficult to not try to compare and I don't have yet
> all the right thinking mechanism. The more Amandeep was replying
> yesterday, more clear it become, but seems I still have a LOT to
> learn.
> I will never update one single value from the data I have. I will
> update all the columns for one row, or not any. When I need to ready
> them, I usually need to read all of them, or almost all. Not just one.
> I moved to a multiple columns architecture because I did the
> application with MySQL first but the more I read, the more I see that
> it's not the right way.
> I can have 2 tables.
> One with a key made with the person ID, and only one single CF and one
> C with everything into a single cell stored as a JSON output
> serialized using AVRO like you are suggesting.
> And a second table with rows ike PERSONID_PERSONADDRESS with a dummy
> CF and C just to keep one cell.
> At the end, that will meet all my needs but that will ask a bit more
> thinking. And it's so far from the initial design! But I think that's
> definitively a good solution.
> Thanks!
> JM
> 2012/7/3, Michael Segel <[EMAIL PROTECTED]>:
>> Hi,
>> You're over thinking this.
>> Take a step back and remember that you can store anything you want as a byte
>> stream in a column.
>> Literally.
>> So you have a record that could be a text blob. Store it in one column. Use
>> JSON to define its structure and fields.
>> The only thing that makes it difficult is that you will need to pull out
>> everything just to insert or update something.
>> So then maybe segment your data in to logical blocks. Like a column that
>> stores the physical attributes of the person.
>> Another column that stores the list of addresses for the person.
>> Another column that stores the list of aliases used by the person.
>> Don't think in relational terms. HBase isn't relational and ER is not the
>> best way to model in a NoSQL database.
>> Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS.
>> The only relationships in HBase are weak relationships between tables.
>> Column Families currently have some nasty side effects that you may want to
>> consider how you apply them.
>> Think in terms of records.
>> Look at storing data using Avro.
>> On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote:
>>> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>:
>>>>> Here are the 2 options now. Both with a new table.
>>>>> 1) I store the key "personID" and a:a1 to a:an for the addresses.
>>>>> 2) I store the key "personID" + "address
>>>>> In both I will have the same amount of data. In #1 total size will be
>>>>> smaller since the key will be stored only once.
>>>> The size will be the same. The underlying HFile will store 1 row per
>>>> cell
>>>> and the number of cells in both cases is the same.
>>>> However, the first approach with multiple columns for addresses needs you
>>>> to
>>>> keep track of the number and makes updates, deletes, additions
>>>> complicated
>>>> as I highlighted earlier. The second option with putting both things in
>>>> the
>>>> key makes life much easier.
>>>> If the data is primarily being accessed independently, I'd go with option
>>>> 2.