Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Advices for HTable schema


Copy link to this message
-
Re: Advices for HTable schema
Michael Segel 2012-07-03, 14:03
Comparisons are fine.

Try to not think of this in terms of rows and columns, but in terms of records.
Think of each record as being atomic.  
Create a list of all of the components that make up that record.
Then combine like components in to structures.

Like the Street Address.  Add in a couple of fields to suggest when the person lived there. If there is no end date, it must be a current address.
You could put them in an Array however array's imply a finite size. Ordered set or list would be more appropriate.
Each of these structures then becomes a column.
On Jul 3, 2012, at 7:31 AM, Jean-Marc Spaggiari wrote:

> Hi Michael,
>
> I'm trying to deeply dive into HBase and forget all my RDBMS knowledge
> but sometime it's difficult to not try to compare and I don't have yet
> all the right thinking mechanism. The more Amandeep was replying
> yesterday, more clear it become, but seems I still have a LOT to
> learn.
>
> I will never update one single value from the data I have. I will
> update all the columns for one row, or not any. When I need to ready
> them, I usually need to read all of them, or almost all. Not just one.
> I moved to a multiple columns architecture because I did the
> application with MySQL first but the more I read, the more I see that
> it's not the right way.
>
> I can have 2 tables.
> One with a key made with the person ID, and only one single CF and one
> C with everything into a single cell stored as a JSON output
> serialized using AVRO like you are suggesting.
> And a second table with rows ike PERSONID_PERSONADDRESS with a dummy
> CF and C just to keep one cell.
>
> At the end, that will meet all my needs but that will ask a bit more
> thinking. And it's so far from the initial design! But I think that's
> definitively a good solution.
>
> Thanks!
>
> JM
>
> 2012/7/3, Michael Segel <[EMAIL PROTECTED]>:
>> Hi,
>>
>> You're over thinking this.
>>
>> Take a step back and remember that you can store anything you want as a byte
>> stream in a column.
>> Literally.
>>
>> So you have a record that could be a text blob. Store it in one column. Use
>> JSON to define its structure and fields.
>>
>> The only thing that makes it difficult is that you will need to pull out
>> everything just to insert or update something.
>> So then maybe segment your data in to logical blocks. Like a column that
>> stores the physical attributes of the person.
>> Another column that stores the list of addresses for the person.
>> Another column that stores the list of aliases used by the person.
>>
>> Don't think in relational terms. HBase isn't relational and ER is not the
>> best way to model in a NoSQL database.
>> Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS.
>>
>> The only relationships in HBase are weak relationships between tables.
>> Column Families currently have some nasty side effects that you may want to
>> consider how you apply them.
>>
>> Think in terms of records.
>>
>> Look at storing data using Avro.
>>
>> On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote:
>>
>>> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>:
>>>>> Here are the 2 options now. Both with a new table.
>>>>>
>>>>> 1) I store the key "personID" and a:a1 to a:an for the addresses.
>>>>> 2) I store the key "personID" + "address
>>>>>
>>>>> In both I will have the same amount of data. In #1 total size will be
>>>>> smaller since the key will be stored only once.
>>>>>
>>>>>
>>>>
>>>> The size will be the same. The underlying HFile will store 1 row per
>>>> cell
>>>> and the number of cells in both cases is the same.
>>>>
>>>> However, the first approach with multiple columns for addresses needs you
>>>> to
>>>> keep track of the number and makes updates, deletes, additions
>>>> complicated
>>>> as I highlighted earlier. The second option with putting both things in
>>>> the
>>>> key makes life much easier.
>>>>
>>>> If the data is primarily being accessed independently, I'd go with option
>>>> 2.