Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Of hbase key distribution and query scalability, again.


Copy link to this message
-
Re: Of hbase key distribution and query scalability, again.
Dmitriy,

If I understand you right, what you're asking about might be called "Read Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but then say:

for (int x = 0; x < 10000000000; x++) {
   htable.get(new Get(Bytes.toBytes("row1")));
}

Then naturally I'm only putting read load on the region server that hosts "row1". That's contrived, of course, you'd never really do that. But I can imagine plenty of situations where there's an imbalance in query load w/r/t the leading part of the row key of a table. It's not fundamentally different from "write hotspotting", except that it's probably less common (it happens frequently in writes because ascending data in a time series or number sequence is a common thing to insert into a database).

I guess the simple answer is, if you know of non-even distribution of read patterns, it might be something to consider in a custom partitioning of the data into regions. I don't know of any other technique (short of some external caching mechanism) that'd alleviate this; at base, you still have to ask exactly one RS for any given piece of data.

Ian

On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:

> Hello,
>
> I'd like to collect opinions from HBase experts on the query
> uniformity and whether there's any advance technique currently exists
> in HBase to cope with the problems of query uniformity beyond just
> maintaining the key uniform distribution.
>
> I know we start with the statement that in order to scale queries, we
> need them uniformly distributed over key space. The next advice people
> get is to use uniformly distributed key. Then, the thinking goes, the
> query load will also be uniformly distributed among regions.
>
> For what seems to be an embarassingly long time i was missing the
> point however that using uniformly distributed keys does not equate
> uniform distribution of the queries since it doesn't account for
> skewness of queries over the key space itself. This skewness can be
> bad enough under some circumstances to create query hot spots in the
> cluster which could have been avoided should region splits were
> balanced based on query loads rather than on a data size per se. (sort
> of dynamic query distribution sampling in order to equalize the load
> similar to how TotalOrderPartitioner does random data sampling to
> build distribution of the key skewness in the incoming data).
>
> To cut a long story, is the region size the only current HBase
> technique to balance load, esp. w.r.t query load? Or perhaps there are
> some more advanced techniques to do that ?
>
> Thank you very much.
> -Dmitriy