Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Advices for HTable schema


Copy link to this message
-
Re: Advices for HTable schema
Amandeep Khurana 2012-07-02, 23:52
Inline
On Monday, July 2, 2012 at 4:48 PM, Jean-Marc Spaggiari wrote:

> Addresses will mainly be accessed independently, and sometime only,
> with the other data.
>
> I'm not sure either to prefer the "versions" option. So if I go with a
> 2nd table, does it mean it's better to have more rows than more
> columns?
>
> Here are the 2 options now. Both with a new table.
>
> 1) I store the key "personID" and a:a1 to a:an for the addresses.
> 2) I store the key "personID" + "address
>
> In both I will have the same amount of data. In #1 total size will be
> smaller since the key will be stored only once.
>
>

The size will be the same. The underlying HFile will store 1 row per cell and the number of cells in both cases is the same.

However, the first approach with multiple columns for addresses needs you to keep track of the number and makes updates, deletes, additions complicated as I highlighted earlier. The second option with putting both things in the key makes life much easier.

If the data is primarily being accessed independently, I'd go with option 2.
> In #1 I will have more
> columns where in #2 I will have more rows.
>
> Is there one better than the other one? Also, if I go with option 1,
> why is it better to have a 2nd table instead of a 2nd column familly?
>
> JM
>
> 2012/7/2, Amandeep Khurana <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])>:
> > Responses inline
> >
> >
> > On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote:
> >
> > > Hi Amandeep,
> > >
> > > Thanks for your prompt reply.
> > >
> > > I forgot to add that all the addresses are valid at the same time.
> > > There is no orders int the addresses. They are all active addresses at
> > > the same time. If one is not valid any more, it's removed. If there is
> > > a new one, it's added to the list, not replacing any other. So it's
> > > not "the last address", but I have to consider all the addresses when
> > > I will process them.
> > >
> > > Regarding adding the address count in the first CF, don't ask me why I
> > > have put in in 'a'. I have no clue why I did not tought about adding
> > > it to 'b' directly. I agree that it's useless to have it in 'a'.
> > >
> > > The idea of the hash as a column name was just to have something to
> > > put there. It's like the '1' in the second solution. A random number
> > > will do the same thing.
> > >
> > > I'm accessing the data in 2 ways.
> > > 1) I acces the person information to update them or retreive all of
> > > them to display them
> > > 2) I access only the address the compute some statistiques about it.
> > > Which mean usually I read ALL the address for one person and not just
> > > one address at a time.
> > >
> >
> > So, that means that the addresses are accessed independently of the other
> > information and you always access all the addresses together? Or does that
> > mean that the addresses are accessed along with the other information to
> > display or retrieve and they are also accessed separately for the stats
> > calculation?
> >
> > You could consider the following ideas:
> >
> > 1. Store everything in 'a' and let all addresses go into the column
> > 'a:address'. Increase the versions to N, where N is the max number of
> > addresses you want to store for any user.
> >
> > OR
> >
> > 2. Store addresses in an entirely different table with the rowkey being
> > user+address. The column qualifier and cell value could be just a simple 1
> > for the sake of having something there. When you want to get all addresses
> > for a user, you just scan from start key 'user' to end key 'user+1'.
> >
> > I'm not a fan of the first schema option that you outlined earlier because
> > of the complexity involved in the client code. That approach works with
> > relational databases where you have the ability to do transactions. In the
> > HBase world, not so much.
> > >
> > > So basically, there all the 3 options almost the same thing. If I
> > > store the number of addresses, I will have more work when I have to