Jean-Marc Spaggiari 2012-07-02, 19:04
Amandeep Khurana 2012-07-02, 19:21
Jean-Marc Spaggiari 2012-07-02, 19:53
Amandeep Khurana 2012-07-02, 20:08
Jean-Marc Spaggiari 2012-07-02, 23:48
Amandeep Khurana 2012-07-02, 23:52
Jean-Marc Spaggiari 2012-07-03, 01:56
Michael Segel 2012-07-03, 11:57
Jean-Marc Spaggiari 2012-07-03, 12:31
-Re: Advices for HTable schema
Michael Segel 2012-07-03, 14:03
Comparisons are fine.
Try to not think of this in terms of rows and columns, but in terms of records.
Think of each record as being atomic.
Create a list of all of the components that make up that record.
Then combine like components in to structures.
Like the Street Address. Add in a couple of fields to suggest when the person lived there. If there is no end date, it must be a current address.
You could put them in an Array however array's imply a finite size. Ordered set or list would be more appropriate.
Each of these structures then becomes a column.
On Jul 3, 2012, at 7:31 AM, Jean-Marc Spaggiari wrote:
> Hi Michael,
> I'm trying to deeply dive into HBase and forget all my RDBMS knowledge
> but sometime it's difficult to not try to compare and I don't have yet
> all the right thinking mechanism. The more Amandeep was replying
> yesterday, more clear it become, but seems I still have a LOT to
> I will never update one single value from the data I have. I will
> update all the columns for one row, or not any. When I need to ready
> them, I usually need to read all of them, or almost all. Not just one.
> I moved to a multiple columns architecture because I did the
> application with MySQL first but the more I read, the more I see that
> it's not the right way.
> I can have 2 tables.
> One with a key made with the person ID, and only one single CF and one
> C with everything into a single cell stored as a JSON output
> serialized using AVRO like you are suggesting.
> And a second table with rows ike PERSONID_PERSONADDRESS with a dummy
> CF and C just to keep one cell.
> At the end, that will meet all my needs but that will ask a bit more
> thinking. And it's so far from the initial design! But I think that's
> definitively a good solution.
> 2012/7/3, Michael Segel <[EMAIL PROTECTED]>:
>> You're over thinking this.
>> Take a step back and remember that you can store anything you want as a byte
>> stream in a column.
>> So you have a record that could be a text blob. Store it in one column. Use
>> JSON to define its structure and fields.
>> The only thing that makes it difficult is that you will need to pull out
>> everything just to insert or update something.
>> So then maybe segment your data in to logical blocks. Like a column that
>> stores the physical attributes of the person.
>> Another column that stores the list of addresses for the person.
>> Another column that stores the list of aliases used by the person.
>> Don't think in relational terms. HBase isn't relational and ER is not the
>> best way to model in a NoSQL database.
>> Think IMS/COBOL (mainframe) or Dick Pick's Revelation's OS.
>> The only relationships in HBase are weak relationships between tables.
>> Column Families currently have some nasty side effects that you may want to
>> consider how you apply them.
>> Think in terms of records.
>> Look at storing data using Avro.
>> On Jul 2, 2012, at 8:56 PM, Jean-Marc Spaggiari wrote:
>>> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED]>:
>>>>> Here are the 2 options now. Both with a new table.
>>>>> 1) I store the key "personID" and a:a1 to a:an for the addresses.
>>>>> 2) I store the key "personID" + "address
>>>>> In both I will have the same amount of data. In #1 total size will be
>>>>> smaller since the key will be stored only once.
>>>> The size will be the same. The underlying HFile will store 1 row per
>>>> and the number of cells in both cases is the same.
>>>> However, the first approach with multiple columns for addresses needs you
>>>> keep track of the number and makes updates, deletes, additions
>>>> as I highlighted earlier. The second option with putting both things in
>>>> key makes life much easier.
>>>> If the data is primarily being accessed independently, I'd go with option