-Re: hbase hashing algorithm and schema design
Otis Gospodnetic 2011-06-09, 01:02
Sam, would HBaseWD help you here?
Sematext :: http://sematext.com/ :: Solr - Lucene - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/
----- Original Message ----
> From: Sam Seigal <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Sent: Wed, June 8, 2011 4:54:24 PM
> Subject: Re: hbase hashing algorithm and schema design
> On Wed, Jun 8, 2011 at 12:40 AM, tsuna <[EMAIL PROTECTED]> wrote:
> > On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <[EMAIL PROTECTED]> wrote:
> > > I was studying the OpenTSDB example, where they also prefix the row keys
> > with
> > > event id.
> > >
> > > I further modified my row keys to have this ->
> > >
> > > <eventid> <uuid> <yyyy-mm-dd>
> > >
> > > The uuid is fairly unique and random.
> > > Is appending a uuid to the event id help the distribution ?
> > Yes it will help the distribution, but it will also make certain query
> > patterns harder. You can no longer scan for a time range, for a given
> > eventid. How to solve this problem depends on how you generate the
> > UUIDs.
> > I wouldn't recommend doing this unless you've already tried simpler
> > approaches and reached the conclusion that they don't work. Many
> > people seem to be afraid of creating hot spots in their tables without
> > having first-hand evidence that the hot spots would actually be a
> > problem.
> Can I not use regex row filters to query for date ranges ? There is an added
> overhead for the client to
> order them, and it is not an efficient query, but it is certainly possible
> to do so ? Am I wrong ?
> > > Let us say if I have 4 region servers to start off with and I start the
> > If you have only 4 region servers, your goal should be to have roughly
> > 25% of writes going to each server. It doesn't matter if the 25%
> > slice of one server is going to a single region or not. As long as
> > all the writes don't go to the same row (which would cause lock
> > contention on that row), you'll get the same kind of performance.
> I am worried about the following scenario, hence putting a lot of thought
> into how to design this schema.
> For example, for simplicity sake, I only have two event Ids A and B, and the
> traffic is equally distributed
> between them i.e. 50% of my traffic is event A and 50% is event B. I have
> two region servers running, on
> two physical nodes with the following schema -
> <eventid> <timestamp>
> Ideally, I now have all of A traffic going into regionServerA and all of B
> traffic going into regionserver B.
> The cluster is able to hold this traffic, and the write load is distributed
> However, now I reach a point where I need to scale, since the two clusters
> are not being able to
> cope with the write traffic. Adding extra regionservers to the cluster is
> not going to make any difference
> , since only the physical machine holding the tail end of the region is the
> one that will receive
> the traffic. Most of my other cluster is going to be idle.
> To generalize, if I want to scale where the # of machines is greater than
> the # of unique event ids, I have no way to
> distribute the load in an efficient manner, since I cannot distribute the
> load of a single event id across multiple machines
> (without adding a uuid somewhere in the middle and sacrificing data locality
> on ordered timestamps).
> Do you think my concern is genuine ? Thanks a lot for your help.
> > > workload, how does HBase decide how many regions is it going to create,
> > and what
> > > key is going to go into what region ?
> > Your table starts as a single region. As this region fills up, it'll
> > split. Where it split is chosen by HBase. HBase tries to spit the
> > region "in the middle", so that roughly the number of keys ends up in