|
Something Something
2012-02-27, 06:24
T Vinod Gupta
2012-02-27, 06:48
Something Something
2012-02-27, 07:45
Rohit Kelkar
2012-02-27, 08:11
T Vinod Gupta
2012-02-27, 10:30
|
-
Couple of schema design questionsSomething Something 2012-02-27, 06:24
Trying to design a HBase schema for a log processing application. We will
get new logs every day. 1) We are thinking we will keep data for each day in separate tables. The table names would be something like XYZ-2012-02-26 etc. There will be at most 4 tables for each day. Pros: Other processes that are processing old data are not affected while data is getting ready for each day. It's easier to delete old data that's no longer needed. Just delete the tables. Cons: Lots of tables to deal with. Any other?? (Other option is, of course, to create a Table with dates and other tables will have keys that contain date - at the end of the row key). 2) We are thinking the RowKeys will be in String format with a separator character e.g. ordernum*itemnum. The keys will only contain IDs & these IDs will be small, probably 6 digits each. Pros: It's easier to look/search for data using HBase Shell. Very easy to implement. Cons: As pointed out here (http://hbase.apache.org/book/rowkey.design.html), Strings need nearly 3x the bytes. (Other option is to create a separate Classes for compound row keys. Is it worth the effort?) Is there a general consensus regarding these issues? Thanks in advance for your help.
-
Re: Couple of schema design questionsT Vinod Gupta 2012-02-27, 06:48
before even getting into schema design, im curious to know why you even
need hbase to store logs? coming to the options below, option 1 sounds very naive and unsophisticated.. you will not only have to deal with many tables but the processing around the times when date changes is going to be unnecessarily complex. besides, most common use of logs is to search for stuff. if you have different tables for different days, it will get cumbersome to search.. regarding the right schema, it all depends on your use case. so if you can give more details on what you want to do with the stored data, that helps. the row key, column family and column name structure depends on what is your access pattern (both reads and writes) and sorting requirements. thanks On Sun, Feb 26, 2012 at 10:24 PM, Something Something < [EMAIL PROTECTED]> wrote: > Trying to design a HBase schema for a log processing application. We will > get new logs every day. > > 1) We are thinking we will keep data for each day in separate tables. The > table names would be something like XYZ-2012-02-26 etc. There will be at > most 4 tables for each day. > > Pros: > Other processes that are processing old data are not affected while data is > getting ready for each day. > It's easier to delete old data that's no longer needed. Just delete the > tables. > > Cons: > Lots of tables to deal with. > Any other?? > > (Other option is, of course, to create a Table with dates and other tables > will have keys that contain date - at the end of the row key). > > > 2) We are thinking the RowKeys will be in String format with a separator > character e.g. ordernum*itemnum. The keys will only contain IDs & these > IDs will be small, probably 6 digits each. > > Pros: > It's easier to look/search for data using HBase Shell. > Very easy to implement. > > Cons: > As pointed out here (http://hbase.apache.org/book/rowkey.design.html), > Strings need nearly 3x the bytes. > > (Other option is to create a separate Classes for compound row keys. Is it > worth the effort?) > > > Is there a general consensus regarding these issues? Thanks in advance for > your help. >
-
Re: Couple of schema design questionsSomething Something 2012-02-27, 07:45
>>why you even need hbase to store logs
So that all the useful information in the logs can be sliced & diced anyway we want quickly without the need for sequential search. Isn't indexed search faster than sequential? Isn't that why HBase (and BigTable for that matter) was created in the first place? >>you will not only have to deal with many tables This was pointed out in the 'Cons' section. We understand that but deleting data older than 60 days is very easy. Just need to delete those tables. >>... when data changes is going to be unnecessarily complex. Once created, data will NOT change. The data is from logs from previous days. It's historical data. >>if you have different tables for different days, it will get cumbersome to search.. When user needs data across multiple dates, we can either get data sequentially for each day for small queries OR for long running queries get data by running queries in parallel for each day & then combining results for all days. Keeping HBase Regions separate for each day does provide some performance benefits - we think. This is where we need help from the community. >>so if you can give more details on what you want to do with the stored data Hmm. The 2nd question is more about understanding the pros & cons of using 'String' Vs 'Custom Class' for Row Keys. Thanks. On Sun, Feb 26, 2012 at 10:48 PM, T Vinod Gupta <[EMAIL PROTECTED]>wrote: > before even getting into schema design, im curious to know why you even > need hbase to store logs? > > coming to the options below, option 1 sounds very naive and > unsophisticated.. you will not only have to deal with many tables but the > processing around the times when date changes is going to be unnecessarily > complex. besides, most common use of logs is to search for stuff. if you > have different tables for different days, it will get cumbersome to > search.. > > regarding the right schema, it all depends on your use case. so if you can > give more details on what you want to do with the stored data, that helps. > the row key, column family and column name structure depends on what is > your access pattern (both reads and writes) and sorting requirements. > > thanks > > On Sun, Feb 26, 2012 at 10:24 PM, Something Something < > [EMAIL PROTECTED]> wrote: > > > Trying to design a HBase schema for a log processing application. We > will > > get new logs every day. > > > > 1) We are thinking we will keep data for each day in separate tables. > The > > table names would be something like XYZ-2012-02-26 etc. There will be > at > > most 4 tables for each day. > > > > Pros: > > Other processes that are processing old data are not affected while data > is > > getting ready for each day. > > It's easier to delete old data that's no longer needed. Just delete the > > tables. > > > > Cons: > > Lots of tables to deal with. > > Any other?? > > > > (Other option is, of course, to create a Table with dates and other > tables > > will have keys that contain date - at the end of the row key). > > > > > > 2) We are thinking the RowKeys will be in String format with a separator > > character e.g. ordernum*itemnum. The keys will only contain IDs & these > > IDs will be small, probably 6 digits each. > > > > Pros: > > It's easier to look/search for data using HBase Shell. > > Very easy to implement. > > > > Cons: > > As pointed out here (http://hbase.apache.org/book/rowkey.design.html), > > Strings need nearly 3x the bytes. > > > > (Other option is to create a separate Classes for compound row keys. Is > it > > worth the effort?) > > > > > > Is there a general consensus regarding these issues? Thanks in advance > for > > your help. > > >
-
Re: Couple of schema design questionsRohit Kelkar 2012-02-27, 08:11
One table per day just doesn't make sense. If your most frequent use
case is retrieving data for a single day then you might want to design your row keys to include date or simply rowKey = "year-day of year". The row keys are sorted lexically by hbase. This way you can easily retrieve all records for a day by using scan alongwith rowKeyFilters. To perform analysis on data from multiple days you can then use map reduce jobs. - Rohit Kelkar On Mon, Feb 27, 2012 at 1:15 PM, Something Something <[EMAIL PROTECTED]> wrote: >>>why you even need hbase to store logs > So that all the useful information in the logs can be sliced & diced anyway > we want quickly without the need for sequential search. Isn't indexed > search faster than sequential? Isn't that why HBase (and BigTable for that > matter) was created in the first place? > >>>you will not only have to deal with many tables > This was pointed out in the 'Cons' section. We understand that but > deleting data older than 60 days is very easy. Just need to delete those > tables. > >>>... when data changes is going to be unnecessarily complex. > Once created, data will NOT change. The data is from logs from previous > days. It's historical data. > >>>if you have different tables for different days, it will get cumbersome > to search.. > When user needs data across multiple dates, we can either get data > sequentially for each day for small queries OR for long running queries get > data by running queries in parallel for each day & then combining results > for all days. Keeping HBase Regions separate for each day does provide > some performance benefits - we think. This is where we need help from the > community. > >>>so if you can give more details on what you want to do with the stored > data > Hmm. The 2nd question is more about understanding the pros & cons of using > 'String' Vs 'Custom Class' for Row Keys. > > Thanks. > > > On Sun, Feb 26, 2012 at 10:48 PM, T Vinod Gupta <[EMAIL PROTECTED]>wrote: > >> before even getting into schema design, im curious to know why you even >> need hbase to store logs? >> >> coming to the options below, option 1 sounds very naive and >> unsophisticated.. you will not only have to deal with many tables but the >> processing around the times when date changes is going to be unnecessarily >> complex. besides, most common use of logs is to search for stuff. if you >> have different tables for different days, it will get cumbersome to >> search.. >> >> regarding the right schema, it all depends on your use case. so if you can >> give more details on what you want to do with the stored data, that helps. >> the row key, column family and column name structure depends on what is >> your access pattern (both reads and writes) and sorting requirements. >> >> thanks >> >> On Sun, Feb 26, 2012 at 10:24 PM, Something Something < >> [EMAIL PROTECTED]> wrote: >> >> > Trying to design a HBase schema for a log processing application. We >> will >> > get new logs every day. >> > >> > 1) We are thinking we will keep data for each day in separate tables. >> The >> > table names would be something like XYZ-2012-02-26 etc. There will be >> at >> > most 4 tables for each day. >> > >> > Pros: >> > Other processes that are processing old data are not affected while data >> is >> > getting ready for each day. >> > It's easier to delete old data that's no longer needed. Just delete the >> > tables. >> > >> > Cons: >> > Lots of tables to deal with. >> > Any other?? >> > >> > (Other option is, of course, to create a Table with dates and other >> tables >> > will have keys that contain date - at the end of the row key). >> > >> > >> > 2) We are thinking the RowKeys will be in String format with a separator >> > character e.g. ordernum*itemnum. The keys will only contain IDs & these >> > IDs will be small, probably 6 digits each. >> > >> > Pros: >> > It's easier to look/search for data using HBase Shell. >> > Very easy to implement. >> > >
-
Re: Couple of schema design questionsT Vinod Gupta 2012-02-27, 10:30
if you want data to get deleted after a while, you can put a ttl on the
table. based on what you described, you can append (suffix preferably) date to your row names so that you can do scans/gets appropriately for your use case. thanks On Sun, Feb 26, 2012 at 11:45 PM, Something Something < [EMAIL PROTECTED]> wrote: > >>why you even need hbase to store logs > So that all the useful information in the logs can be sliced & diced anyway > we want quickly without the need for sequential search. Isn't indexed > search faster than sequential? Isn't that why HBase (and BigTable for that > matter) was created in the first place? > > >>you will not only have to deal with many tables > This was pointed out in the 'Cons' section. We understand that but > deleting data older than 60 days is very easy. Just need to delete those > tables. > > >>... when data changes is going to be unnecessarily complex. > Once created, data will NOT change. The data is from logs from previous > days. It's historical data. > > >>if you have different tables for different days, it will get cumbersome > to search.. > When user needs data across multiple dates, we can either get data > sequentially for each day for small queries OR for long running queries get > data by running queries in parallel for each day & then combining results > for all days. Keeping HBase Regions separate for each day does provide > some performance benefits - we think. This is where we need help from the > community. > > >>so if you can give more details on what you want to do with the stored > data > Hmm. The 2nd question is more about understanding the pros & cons of using > 'String' Vs 'Custom Class' for Row Keys. > > Thanks. > > > On Sun, Feb 26, 2012 at 10:48 PM, T Vinod Gupta <[EMAIL PROTECTED] > >wrote: > > > before even getting into schema design, im curious to know why you even > > need hbase to store logs? > > > > coming to the options below, option 1 sounds very naive and > > unsophisticated.. you will not only have to deal with many tables but the > > processing around the times when date changes is going to be > unnecessarily > > complex. besides, most common use of logs is to search for stuff. if you > > have different tables for different days, it will get cumbersome to > > search.. > > > > regarding the right schema, it all depends on your use case. so if you > can > > give more details on what you want to do with the stored data, that > helps. > > the row key, column family and column name structure depends on what is > > your access pattern (both reads and writes) and sorting requirements. > > > > thanks > > > > On Sun, Feb 26, 2012 at 10:24 PM, Something Something < > > [EMAIL PROTECTED]> wrote: > > > > > Trying to design a HBase schema for a log processing application. We > > will > > > get new logs every day. > > > > > > 1) We are thinking we will keep data for each day in separate tables. > > The > > > table names would be something like XYZ-2012-02-26 etc. There will be > > at > > > most 4 tables for each day. > > > > > > Pros: > > > Other processes that are processing old data are not affected while > data > > is > > > getting ready for each day. > > > It's easier to delete old data that's no longer needed. Just delete > the > > > tables. > > > > > > Cons: > > > Lots of tables to deal with. > > > Any other?? > > > > > > (Other option is, of course, to create a Table with dates and other > > tables > > > will have keys that contain date - at the end of the row key). > > > > > > > > > 2) We are thinking the RowKeys will be in String format with a > separator > > > character e.g. ordernum*itemnum. The keys will only contain IDs & > these > > > IDs will be small, probably 6 digits each. > > > > > > Pros: > > > It's easier to look/search for data using HBase Shell. > > > Very easy to implement. > > > > > > Cons: > > > As pointed out here (http://hbase.apache.org/book/rowkey.design.html), > > > Strings need nearly 3x the bytes. |