Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Couple of schema design questions


Copy link to this message
-
Re: Couple of schema design questions
>>why you even need hbase to store logs
So that all the useful information in the logs can be sliced & diced anyway
we want quickly without the need for sequential search.  Isn't indexed
search faster than sequential?  Isn't that why HBase (and BigTable for that
matter) was created in the first place?

>>you will not only have to deal with many tables
This was pointed out in the 'Cons' section.  We understand that but
deleting data older than 60 days is very easy.  Just need to delete those
tables.

>>... when data changes is going to be unnecessarily complex.
Once created, data will NOT change.  The data is from logs from previous
days.  It's historical data.

>>if you have different tables for different days, it will get cumbersome
to search..
When user needs data across multiple dates, we can either get data
sequentially for each day for small queries OR for long running queries get
data by running queries in parallel for each day & then combining results
for all days.  Keeping HBase Regions separate for each day does provide
some performance benefits - we think.  This is where we need help from the
community.

>>so if you can give more details on what you want to do with the stored
data
Hmm.  The 2nd question is more about understanding the pros & cons of using
'String' Vs 'Custom Class' for Row Keys.

Thanks.
On Sun, Feb 26, 2012 at 10:48 PM, T Vinod Gupta <[EMAIL PROTECTED]>wrote:

> before even getting into schema design, im curious to know why you even
> need hbase to store logs?
>
> coming to the options below, option 1 sounds very naive and
> unsophisticated.. you will not only have to deal with many tables but the
> processing around the times when date changes is going to be unnecessarily
> complex. besides, most common use of logs is to search for stuff. if you
> have different tables for different days, it will get cumbersome to
> search..
>
> regarding the right schema, it all depends on your use case. so if you can
> give more details on what you want to do with the stored data, that helps.
> the row key, column family and column name structure depends on what is
> your access pattern (both reads and writes) and sorting requirements.
>
> thanks
>
> On Sun, Feb 26, 2012 at 10:24 PM, Something Something <
> [EMAIL PROTECTED]> wrote:
>
> > Trying to design a HBase schema for a log processing application.  We
> will
> > get new logs every day.
> >
> > 1)  We are thinking we will keep data for each day in separate tables.
>  The
> > table names would be something like  XYZ-2012-02-26 etc.  There will be
> at
> > most 4 tables for each day.
> >
> > Pros:
> > Other processes that are processing old data are not affected while data
> is
> > getting ready for each day.
> > It's easier to delete old data that's no longer needed.  Just delete the
> > tables.
> >
> > Cons:
> > Lots of tables to deal with.
> > Any other??
> >
> > (Other option is, of course, to create a Table with dates and other
> tables
> > will have keys that contain date - at the end of the row key).
> >
> >
> > 2)  We are thinking the RowKeys will be in String format with a separator
> > character e.g.  ordernum*itemnum.  The keys will only contain IDs & these
> > IDs will be small, probably 6 digits each.
> >
> > Pros:
> > It's easier to look/search for data using HBase Shell.
> > Very easy to implement.
> >
> > Cons:
> > As pointed out here (http://hbase.apache.org/book/rowkey.design.html),
> > Strings need nearly 3x the bytes.
> >
> > (Other option is to create a separate Classes for compound row keys. Is
> it
> > worth the effort?)
> >
> >
> > Is there a general consensus regarding these issues?  Thanks in advance
> for
> > your help.
> >
>