Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> hbase hashing algorithm and schema design


Copy link to this message
-
Re: hbase hashing algorithm and schema design
On Tue, Jun 7, 2011 at 7:56 PM, Kjew Jned <[EMAIL PROTECTED]> wrote:
> I was studying the OpenTSDB example, where they also prefix the row keys with
> event id.
>
> I further modified my row keys to have this ->
>
> <eventid> <uuid>  <yyyy-mm-dd>
>
> The uuid is fairly unique and random.
> Is appending a uuid to the event id help the distribution ?

Yes it will help the distribution, but it will also make certain query
patterns harder.  You can no longer scan for a time range, for a given
eventid.  How to solve this problem depends on how you generate the
UUIDs.

I wouldn't recommend doing this unless you've already tried simpler
approaches and reached the conclusion that they don't work.  Many
people seem to be afraid of creating hot spots in their tables without
having first-hand evidence that the hot spots would actually be a
problem.

> Let us say if I have 4 region servers to start off with and I start the

If you have only 4 region servers, your goal should be to have roughly
25% of writes going to each server.  It doesn't matter if the 25%
slice of one server is going to a single region or not.  As long as
all the writes don't go to the same row (which would cause lock
contention on that row), you'll get the same kind of performance.

> workload, how does HBase decide how many regions is it going to create, and what
> key is going to go into what region ?

Your table starts as a single region.  As this region fills up, it'll
split.  Where it split is chosen by HBase.  HBase tries to spit the
region "in the middle", so that roughly the number of keys ends up in
each new daughter region.

You can also manually pre-split a table (from the shell).  This can be
advantageous in certain situations where you know what your table will
look like and you have a very high write volume coupled with
aggressive latency requirements for >95th percentile.

> I could have gone with something like
>
> <uuid><eventid><yyyy-mm-dd> , but would not like to, since my queries are always
> going to be against a particular event id type, and i would like them to be
> spatially located.

If you have a lot of data per <eventid>, then putting the <uuid> in
between the <eventid> and the <yyyy-mm-dd> will screw up data locality
anyway.  But the exact details depend on how you pick the <uuid>.

--
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB