|
yutoo yanio
2012-10-10, 15:24
Jerry Lam
2012-10-10, 16:55
Doug Meil
2012-10-10, 17:09
Jerry Lam
2012-10-10, 19:08
Shumin Wu
2012-10-10, 19:51
yutoo yanio
2012-10-11, 08:27
Anoop Sam John
2012-10-11, 09:10
|
-
key designyutoo yanio 2012-10-10, 15:24
hi
i have a question about key & column design. in my application we have 3,000,000,000 record in every day each record contain : user-id, "time stamp", content(max 1KB). we need to store records for one year, this means we will have about 1,000,000,000,000 after 1 year. we just search a user-id over rang of "time stamp" table can design in two way 1.key=userid-timestamp and column:=content 2.key=userid-yyyyMMdd and column:HHmmss=content in first design we have tall-narrow table but we have very very records, in second design we have flat-wide table. which of them have better performance? thanks.
-
Re: key designJerry Lam 2012-10-10, 16:55
Hi:
So you are saying you have ~3TB of data stored per day? Using the second approach, all data for one day will go to only 1 regionserver no matter what you do because HBase doesn't split a single row. Using the first approach, data will spread across regionservers but there will be hotspotted to each regionserver during write since this is a time-series problem. Best Regards, Jerry On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]> wrote: > hi > i have a question about key & column design. > in my application we have 3,000,000,000 record in every day > each record contain : user-id, "time stamp", content(max 1KB). > we need to store records for one year, this means we will have about > 1,000,000,000,000 after 1 year. > we just search a user-id over rang of "time stamp" > table can design in two way > 1.key=userid-timestamp and column:=content > 2.key=userid-yyyyMMdd and column:HHmmss=content > > > in first design we have tall-narrow table but we have very very records, in > second design we have flat-wide table. > which of them have better performance? > > thanks. >
-
Re: key designDoug Meil 2012-10-10, 17:09
Hi there-
Given the fact that the userid is in the lead position of the key in both approaches, I'm not sure that he'd have a region hotspotting problem because the userid should be able to offer some spread. On 10/10/12 12:55 PM, "Jerry Lam" <[EMAIL PROTECTED]> wrote: >Hi: > >So you are saying you have ~3TB of data stored per day? > >Using the second approach, all data for one day will go to only 1 >regionserver no matter what you do because HBase doesn't split a single >row. > >Using the first approach, data will spread across regionservers but there >will be hotspotted to each regionserver during write since this is a >time-series problem. > >Best Regards, > >Jerry > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]> >wrote: > >> hi >> i have a question about key & column design. >> in my application we have 3,000,000,000 record in every day >> each record contain : user-id, "time stamp", content(max 1KB). >> we need to store records for one year, this means we will have about >> 1,000,000,000,000 after 1 year. >> we just search a user-id over rang of "time stamp" >> table can design in two way >> 1.key=userid-timestamp and column:=content >> 2.key=userid-yyyyMMdd and column:HHmmss=content >> >> >> in first design we have tall-narrow table but we have very very >>records, in >> second design we have flat-wide table. >> which of them have better performance? >> >> thanks. >>
-
Re: key designJerry Lam 2012-10-10, 19:08
That's true.Then there would be max. 86,400 records per day per userid.
That is about 100MB per day. I don't see much difference in both approaches from the storage perspective. On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <[EMAIL PROTECTED]>wrote: > Hi there- > > Given the fact that the userid is in the lead position of the key in both > approaches, I'm not sure that he'd have a region hotspotting problem > because the userid should be able to offer some spread. > > > > > On 10/10/12 12:55 PM, "Jerry Lam" <[EMAIL PROTECTED]> wrote: > > >Hi: > > > >So you are saying you have ~3TB of data stored per day? > > > >Using the second approach, all data for one day will go to only 1 > >regionserver no matter what you do because HBase doesn't split a single > >row. > > > >Using the first approach, data will spread across regionservers but there > >will be hotspotted to each regionserver during write since this is a > >time-series problem. > > > >Best Regards, > > > >Jerry > > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]> > >wrote: > > > >> hi > >> i have a question about key & column design. > >> in my application we have 3,000,000,000 record in every day > >> each record contain : user-id, "time stamp", content(max 1KB). > >> we need to store records for one year, this means we will have about > >> 1,000,000,000,000 after 1 year. > >> we just search a user-id over rang of "time stamp" > >> table can design in two way > >> 1.key=userid-timestamp and column:=content > >> 2.key=userid-yyyyMMdd and column:HHmmss=content > >> > >> > >> in first design we have tall-narrow table but we have very very > >>records, in > >> second design we have flat-wide table. > >> which of them have better performance? > >> > >> thanks. > >> > > >
-
Re: key designShumin Wu 2012-10-10, 19:51
The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs.
Flat-Wide tables. The suggested style is to design the table tall-narrow to make splitting easy. Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to keep a creation time, I think it's better to create a column to store it. Just think about every row would have the overheads of this tailing part on storage. Shumin On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <[EMAIL PROTECTED]> wrote: > That's true.Then there would be max. 86,400 records per day per userid. > That is about 100MB per day. I don't see much difference in both approaches > from the storage perspective. > > On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <[EMAIL PROTECTED] > >wrote: > > > Hi there- > > > > Given the fact that the userid is in the lead position of the key in both > > approaches, I'm not sure that he'd have a region hotspotting problem > > because the userid should be able to offer some spread. > > > > > > > > > > On 10/10/12 12:55 PM, "Jerry Lam" <[EMAIL PROTECTED]> wrote: > > > > >Hi: > > > > > >So you are saying you have ~3TB of data stored per day? > > > > > >Using the second approach, all data for one day will go to only 1 > > >regionserver no matter what you do because HBase doesn't split a single > > >row. > > > > > >Using the first approach, data will spread across regionservers but > there > > >will be hotspotted to each regionserver during write since this is a > > >time-series problem. > > > > > >Best Regards, > > > > > >Jerry > > > > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]> > > >wrote: > > > > > >> hi > > >> i have a question about key & column design. > > >> in my application we have 3,000,000,000 record in every day > > >> each record contain : user-id, "time stamp", content(max 1KB). > > >> we need to store records for one year, this means we will have about > > >> 1,000,000,000,000 after 1 year. > > >> we just search a user-id over rang of "time stamp" > > >> table can design in two way > > >> 1.key=userid-timestamp and column:=content > > >> 2.key=userid-yyyyMMdd and column:HHmmss=content > > >> > > >> > > >> in first design we have tall-narrow table but we have very very > > >>records, in > > >> second design we have flat-wide table. > > >> which of them have better performance? > > >> > > >> thanks. > > >> > > > > > > >
-
Re: key designyutoo yanio 2012-10-11, 08:27
we have 200,000,000 user-id and i think user-id is good for lead position
of the key. is it ok? what about search performance? which approach has better result? On Wed, Oct 10, 2012 at 11:21 PM, Shumin Wu <[EMAIL PROTECTED]> wrote: > The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs. > Flat-Wide tables. The suggested style is to design the table tall-narrow to > make splitting easy. > > Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to > keep a creation time, I think it's better to create a column to store it. > Just think about every row would have the overheads of this tailing part on > storage. > > Shumin > > On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <[EMAIL PROTECTED]> wrote: > > > That's true.Then there would be max. 86,400 records per day per userid. > > That is about 100MB per day. I don't see much difference in both > approaches > > from the storage perspective. > > > > On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil < > [EMAIL PROTECTED] > > >wrote: > > > > > Hi there- > > > > > > Given the fact that the userid is in the lead position of the key in > both > > > approaches, I'm not sure that he'd have a region hotspotting problem > > > because the userid should be able to offer some spread. > > > > > > > > > > > > > > > On 10/10/12 12:55 PM, "Jerry Lam" <[EMAIL PROTECTED]> wrote: > > > > > > >Hi: > > > > > > > >So you are saying you have ~3TB of data stored per day? > > > > > > > >Using the second approach, all data for one day will go to only 1 > > > >regionserver no matter what you do because HBase doesn't split a > single > > > >row. > > > > > > > >Using the first approach, data will spread across regionservers but > > there > > > >will be hotspotted to each regionserver during write since this is a > > > >time-series problem. > > > > > > > >Best Regards, > > > > > > > >Jerry > > > > > > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]> > > > >wrote: > > > > > > > >> hi > > > >> i have a question about key & column design. > > > >> in my application we have 3,000,000,000 record in every day > > > >> each record contain : user-id, "time stamp", content(max 1KB). > > > >> we need to store records for one year, this means we will have about > > > >> 1,000,000,000,000 after 1 year. > > > >> we just search a user-id over rang of "time stamp" > > > >> table can design in two way > > > >> 1.key=userid-timestamp and column:=content > > > >> 2.key=userid-yyyyMMdd and column:HHmmss=content > > > >> > > > >> > > > >> in first design we have tall-narrow table but we have very very > > > >>records, in > > > >> second design we have flat-wide table. > > > >> which of them have better performance? > > > >> > > > >> thanks. > > > >> > > > > > > > > > > > >
-
RE: key designAnoop Sam John 2012-10-11, 09:10
>we just search a user-id over rang of "time stamp"
In that case you can go with your 1st approach IMO "1.key=userid-timestamp and column:=content" >we have 200,000,000 user-id and i think user-id is good for lead position of the key. is it ok? Yes it is... -Anoop- ________________________________________ From: yutoo yanio [[EMAIL PROTECTED]] Sent: Thursday, October 11, 2012 1:57 PM To: [EMAIL PROTECTED] Subject: Re: key design we have 200,000,000 user-id and i think user-id is good for lead position of the key. is it ok? what about search performance? which approach has better result? On Wed, Oct 10, 2012 at 11:21 PM, Shumin Wu <[EMAIL PROTECTED]> wrote: > The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs. > Flat-Wide tables. The suggested style is to design the table tall-narrow to > make splitting easy. > > Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to > keep a creation time, I think it's better to create a column to store it. > Just think about every row would have the overheads of this tailing part on > storage. > > Shumin > > On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <[EMAIL PROTECTED]> wrote: > > > That's true.Then there would be max. 86,400 records per day per userid. > > That is about 100MB per day. I don't see much difference in both > approaches > > from the storage perspective. > > > > On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil < > [EMAIL PROTECTED] > > >wrote: > > > > > Hi there- > > > > > > Given the fact that the userid is in the lead position of the key in > both > > > approaches, I'm not sure that he'd have a region hotspotting problem > > > because the userid should be able to offer some spread. > > > > > > > > > > > > > > > On 10/10/12 12:55 PM, "Jerry Lam" <[EMAIL PROTECTED]> wrote: > > > > > > >Hi: > > > > > > > >So you are saying you have ~3TB of data stored per day? > > > > > > > >Using the second approach, all data for one day will go to only 1 > > > >regionserver no matter what you do because HBase doesn't split a > single > > > >row. > > > > > > > >Using the first approach, data will spread across regionservers but > > there > > > >will be hotspotted to each regionserver during write since this is a > > > >time-series problem. > > > > > > > >Best Regards, > > > > > > > >Jerry > > > > > > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]> > > > >wrote: > > > > > > > >> hi > > > >> i have a question about key & column design. > > > >> in my application we have 3,000,000,000 record in every day > > > >> each record contain : user-id, "time stamp", content(max 1KB). > > > >> we need to store records for one year, this means we will have about > > > >> 1,000,000,000,000 after 1 year. > > > >> we just search a user-id over rang of "time stamp" > > > >> table can design in two way > > > >> 1.key=userid-timestamp and column:=content > > > >> 2.key=userid-yyyyMMdd and column:HHmmss=content > > > >> > > > >> > > > >> in first design we have tall-narrow table but we have very very > > > >>records, in > > > >> second design we have flat-wide table. > > > >> which of them have better performance? > > > >> > > > >> thanks. > > > >> > > > > > > > > > > > > |