Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - hbase hashing algorithm and schema design


Copy link to this message
-
Re: hbase hashing algorithm and schema design
Otis Gospodnetic 2011-06-09, 01:02
Sam, would HBaseWD help you here?
See
http://search-hadoop.com/m/AQ7CG2GkiO/hbasewd&subj=+ANN+HBaseWD+Distribute+Sequential+Writes+in+HBase
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

----- Original Message ----
> From: Sam Seigal <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Sent: Wed, June 8, 2011 4:54:24 PM
> Subject: Re: hbase hashing algorithm and schema design
>
> On Wed, Jun 8, 2011 at 12:40 AM, tsuna <[EMAIL PROTECTED]> wrote:
>
> >  On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <[EMAIL PROTECTED]> wrote:
> > >  I was studying the OpenTSDB example, where they also prefix the row keys
> >  with
> > > event id.
> > >
> > > I further modified my row  keys to have this ->
> > >
> > > <eventid>  <uuid>  <yyyy-mm-dd>
> > >
> > > The uuid is  fairly unique and random.
> > > Is appending a uuid to the event id help  the distribution ?
> >
> > Yes it will help the distribution, but it  will also make certain query
> > patterns harder.  You can no longer  scan for a time range, for a given
> > eventid.  How to solve this  problem depends on how you generate the
> > UUIDs.
> >
> > I  wouldn't recommend doing this unless you've already tried simpler
> >  approaches and reached the conclusion that they don't work.  Many
> >  people seem to be afraid of creating hot spots in their tables without
> >  having first-hand evidence that the hot spots would actually be a
> >  problem.
> >
> >
> >
> >
> Can I not use regex row filters to  query for date ranges ? There is an added
> overhead for the client to
> order  them, and it is not an efficient query, but it is certainly possible
> to do so  ? Am I wrong ?
>
>
>
> > > Let us say if I have 4 region servers to  start off with and I start the
> >
> > If you have only 4 region  servers, your goal should be to have roughly
> > 25% of writes going to each  server.  It doesn't matter if the 25%
> > slice of one server is going  to a single region or not.  As long as
> > all the writes don't go to  the same row (which would cause lock
> > contention on that row), you'll get  the same kind of performance.
> >
>
> I am worried about the following  scenario, hence putting a lot of thought
> into how to design this  schema.
>
> For example, for simplicity sake, I only have two event Ids A and  B, and the
> traffic is equally distributed
> between them i.e. 50% of my  traffic is event A and 50% is event B. I have
> two region servers running,  on
> two physical nodes with the following schema -
>
> <eventid>  <timestamp>
>
> Ideally, I now have all of A traffic going into  regionServerA and all of B
> traffic going into regionserver B.
> The cluster  is able to hold this traffic, and the write load is  distributed
> 50-50.
>
> However, now I reach a point where I need to scale,  since the two clusters
> are not being able to
> cope with the write traffic.  Adding extra regionservers to the cluster is
> not going to make any  difference
> , since only the physical machine holding the tail end of the  region is the
> one that will receive
> the traffic. Most of my other cluster  is going to be idle.
>
> To generalize, if I want to scale where the # of  machines is greater than
> the # of unique event ids, I have no way  to
> distribute the load in an efficient manner, since I cannot distribute  the
> load of a single event id across multiple machines
> (without adding a  uuid somewhere in the middle and sacrificing data locality
> on ordered  timestamps).
>
> Do you think my concern is genuine ? Thanks a lot for your  help.
>
>
> >
> > > workload, how does HBase decide how many  regions is it going to create,
> > and what
> > > key is going to go  into what region ?
> >
> > Your table starts as a single region.   As this region fills up, it'll
> > split.  Where it split is chosen by  HBase.  HBase tries to spit the
> > region "in the middle", so that  roughly the number of keys ends up in