Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> key design


Copy link to this message
-
Re: key design
we have 200,000,000 user-id and i think user-id is good for lead position
of the key. is it ok?

what about search performance?  which approach has better result?

On Wed, Oct 10, 2012 at 11:21 PM, Shumin Wu <[EMAIL PROTECTED]> wrote:

> The Definitive Guide has a good discussion in Chapter 9 Tall-Narrow vs.
> Flat-Wide tables. The suggested style is to design the table tall-narrow to
> make splitting easy.
>
> Also in approach 2, why do you need the "-yyyyMMdd" part? If you want to
> keep a creation time, I think it's better to create a column to store it.
> Just think about every row would have the overheads of this tailing part on
> storage.
>
> Shumin
>
> On Wed, Oct 10, 2012 at 12:08 PM, Jerry Lam <[EMAIL PROTECTED]> wrote:
>
> > That's true.Then there would be max. 86,400 records per day per userid.
> > That is about 100MB per day. I don't see much difference in both
> approaches
> > from the storage perspective.
> >
> > On Wed, Oct 10, 2012 at 1:09 PM, Doug Meil <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi there-
> > >
> > > Given the fact that the userid is in the lead position of the key in
> both
> > > approaches, I'm not sure that he'd have a region hotspotting problem
> > > because the userid should be able to offer some spread.
> > >
> > >
> > >
> > >
> > > On 10/10/12 12:55 PM, "Jerry Lam" <[EMAIL PROTECTED]> wrote:
> > >
> > > >Hi:
> > > >
> > > >So you are saying you have ~3TB of data stored per day?
> > > >
> > > >Using the second approach, all data for one day will go to only 1
> > > >regionserver no matter what you do because HBase doesn't split a
> single
> > > >row.
> > > >
> > > >Using the first approach, data will spread across regionservers but
> > there
> > > >will be hotspotted to each regionserver during write since this is a
> > > >time-series problem.
> > > >
> > > >Best Regards,
> > > >
> > > >Jerry
> > > >
> > > >On Wed, Oct 10, 2012 at 11:24 AM, yutoo yanio <[EMAIL PROTECTED]>
> > > >wrote:
> > > >
> > > >> hi
> > > >> i have a question about key & column design.
> > > >> in my application we have 3,000,000,000 record in every day
> > > >> each record contain : user-id, "time stamp", content(max 1KB).
> > > >> we need to store records for one year, this means we will have about
> > > >> 1,000,000,000,000 after 1 year.
> > > >> we just search a user-id over rang of "time stamp"
> > > >> table can design in two way
> > > >> 1.key=userid-timestamp and column:=content
> > > >> 2.key=userid-yyyyMMdd and column:HHmmss=content
> > > >>
> > > >>
> > > >> in first design we have tall-narrow table but we have very very
> > > >>records, in
> > > >> second design we have flat-wide table.
> > > >> which of them have better performance?
> > > >>
> > > >> thanks.
> > > >>
> > >
> > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB