Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Of hbase key distribution and query scalability, again.


Copy link to this message
-
Re: Of hbase key distribution and query scalability, again.
Dmitriy,

If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:

for (int x = 0; x < 10000000000; x++) {
   htable.get(new Get(Bytes.toBytes("row1")));
}

Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).

I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.

Ian

On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:

> Hello,
>
> I'd like to collect opinions from HBase experts on the query
> uniformity and whether there's any advance technique currently exists
> in HBase to cope with the problems of query uniformity beyond just
> maintaining the key uniform distribution.
>
> I know we start with the statement that in order to scale queries, we
> need them uniformly distributed over key space. The next advice people
> get is to use uniformly distributed key. Then, the thinking goes, the
> query load will also be uniformly distributed among regions.
>
> For what seems to be an embarassingly long time i was missing the
> point however that using uniformly distributed keys does not equate
> uniform distribution of the queries since it doesn't account for
> skewness of queries over the key space itself. This skewness can be
> bad enough under some circumstances to create query hot spots in the
> cluster which could have been avoided should region splits were
> balanced based on query loads rather than on a data size per se. (sort
> of dynamic query distribution sampling in order to equalize the load
> similar to how TotalOrderPartitioner does random data sampling to
> build distribution of the key skewness in the incoming data).
>
> To cut a long story, is the region size the only current HBase
> technique to balance load, esp. w.r.t query load? Or perhaps there are
> some more advanced techniques to do that ?
>
> Thank you very much.
> -Dmitriy
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB