Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> hbase hashing algorithm and schema design


Copy link to this message
-
Re: hbase hashing algorithm and schema design
Sam, would HBaseWD help you here?
See
http://search-hadoop.com/m/AQ7CG2GkiO/hbasewd&subj=+ANN+HBaseWD+Distribute+Sequential+Writes+in+HBase
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

----- Original Message ----
> From: Sam Seigal <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Sent: Wed, June 8, 2011 4:54:24 PM
> Subject: Re: hbase hashing algorithm and schema design
>
> On Wed, Jun 8, 2011 at 12:40 AM, tsuna <[EMAIL PROTECTED]> wrote:
>
> >  On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <[EMAIL PROTECTED]> wrote:
> > >  I was studying the OpenTSDB example, where they also prefix the row keys
> >  with
> > > event id.
> > >
> > > I further modified my row  keys to have this ->
> > >
> > > <eventid>  <uuid>  <yyyy-mm-dd>
> > >
> > > The uuid is  fairly unique and random.
> > > Is appending a uuid to the event id help  the distribution ?
> >
> > Yes it will help the distribution, but it  will also make certain query
> > patterns harder.  You can no longer  scan for a time range, for a given
> > eventid.  How to solve this  problem depends on how you generate the
> > UUIDs.
> >
> > I  wouldn't recommend doing this unless you've already tried simpler
> >  approaches and reached the conclusion that they don't work.  Many
> >  people seem to be afraid of creating hot spots in their tables without
> >  having first-hand evidence that the hot spots would actually be a
> >  problem.
> >
> >
> >
> >
> Can I not use regex row filters to  query for date ranges ? There is an added
> overhead for the client to
> order  them, and it is not an efficient query, but it is certainly possible
> to do so  ? Am I wrong ?
>
>
>
> > > Let us say if I have 4 region servers to  start off with and I start the
> >
> > If you have only 4 region  servers, your goal should be to have roughly
> > 25% of writes going to each  server.  It doesn't matter if the 25%
> > slice of one server is going  to a single region or not.  As long as
> > all the writes don't go to  the same row (which would cause lock
> > contention on that row), you'll get  the same kind of performance.
> >
>
> I am worried about the following  scenario, hence putting a lot of thought
> into how to design this  schema.
>
> For example, for simplicity sake, I only have two event Ids A and  B, and the
> traffic is equally distributed
> between them i.e. 50% of my  traffic is event A and 50% is event B. I have
> two region servers running,  on
> two physical nodes with the following schema -
>
> <eventid>  <timestamp>
>
> Ideally, I now have all of A traffic going into  regionServerA and all of B
> traffic going into regionserver B.
> The cluster  is able to hold this traffic, and the write load is  distributed
> 50-50.
>
> However, now I reach a point where I need to scale,  since the two clusters
> are not being able to
> cope with the write traffic.  Adding extra regionservers to the cluster is
> not going to make any  difference
> , since only the physical machine holding the tail end of the  region is the
> one that will receive
> the traffic. Most of my other cluster  is going to be idle.
>
> To generalize, if I want to scale where the # of  machines is greater than
> the # of unique event ids, I have no way  to
> distribute the load in an efficient manner, since I cannot distribute  the
> load of a single event id across multiple machines
> (without adding a  uuid somewhere in the middle and sacrificing data locality
> on ordered  timestamps).
>
> Do you think my concern is genuine ? Thanks a lot for your  help.
>
>
> >
> > > workload, how does HBase decide how many  regions is it going to create,
> > and what
> > > key is going to go  into what region ?
> >
> > Your table starts as a single region.   As this region fills up, it'll
> > split.  Where it split is chosen by  HBase.  HBase tries to spit the
> > region "in the middle", so that  roughly the number of keys ends up in
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB