Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - One table or multiple tables?


Copy link to this message
-
Re: One table or multiple tables?
Bryan Beaudreault 2012-02-03, 00:47
At the risk of sounding contrary, I'll actually give voice to the opposite.

Like Jean said, you haven't said much about your read patterns.  I'd say
understanding that is the first critical part of this.

I'd also argue that it is no less simple to put them all in the same table,
and possibly much more flexible.  I imagine you aren't always going to be
reporting on only a single metric at once.  In the multiple table layout,
you'll need to do multiple scans/gets to retrieve the data you need.  If
you put them all in a single table, you might be able to do a better job
returning them all (or a subset) at once.

I'm not sure what the value you would be storing is, but if it is
reasonable enough you might want to put the metrics as different columns
instead of different rows.  It all depends on the access patterns, but
having them all in the same table opens up more flexibility.  (Beware of
incrementing row keys though: http://hbase.apache.org/book.html#timeseries)

I'd love to hear from an expert on the pros and cons of big tables vs many
tables, when access patterns and simplicity are not a concern[1].  I
haven't found much information regarding it, but I'd imagine the only
benefit to many tables is the ability to configure each differently if that
is helpful for the use case.

[1] By this I mean, 2 or more different data sets where the row keys won't
conflict and will never be queried together.  Is there a benefit to putting
them in multiple tables vs a single, aside from config differences (e.g. #
of column families)?

On Thu, Feb 2, 2012 at 5:37 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:

> You're not telling us much about your read patterns and data
> distribution, but I would go with the former solution for the sake of
> simplicity. You'd want to write your row keys in the same format as
> OpenTSDB does: http://opentsdb.net/schema.html
>
> J-D
>
> On Wed, Feb 1, 2012 at 8:59 AM, Mark <[EMAIL PROTECTED]> wrote:
> > We would like to track all of our users interactions ordered by time.
> > Product views, searches, logins, etc. There are (at least) two ways of
> > accomplishing this:
> >
> > We could use one table 'user_logs' and have keys in the format of.
> > USER_ID/TYPE/TIMESTAMP. Type could be (product view, search, login, etc)
> >
> > Or we could have multiple tables for each type.. UserProductLogs,
> > UserSearchLogs, etc.
> >
> > What are the pros/cons of each strategy and which one do you think I
> should
> > employ?
> >
> > - M
>