There are a few extremes to keep in mind when choosing a manual
1. Parallelism and balance at ingest time. You need to find a happy medium
between too few partitions (not enough parallelism) and too many partitions
(tablet server resource contention and inefficient indexes). Probably at
least one partition per tablet server being actively written to is good,
and you'll want to pre-split so they can be distributed evenly. Ten
partitions per tablet server is probably not too many -- I wouldn't expect
to see contention at that point.
2. Parallelism and balance at query time. At query time, you'll be
selecting a subset of all of the partitions that you've ever ingested into.
This subset should be bounded similarly to the concern addressed in #1, but
the bounds could be looser depending on the types of queries you want to
run. Lower latency queries would tend to favor only a few partitions per
3. Growth over time in partition size. Over time, you want partitions to be
bounded to less than about 10-100GB. This has to do with limiting the
maximum amount of time that a major compaction will take, and impacts
availability and performance in the extreme cases. At the same time, you
want partitions to be as large as possible so that their indexes are more
One strategy to optimize partition size would be to keep using each
partition until it is "full", then make another partition. Another would be
to allocate a certain number of partitions per day, and then only put data
in those partitions during that day. These strategies are also elastic, and
can be tweaked as the cluster grows.
In all of these cases, you will need a good load balancing strategy. The
default strategy of evening up the number of partitions per tablet server
is probably not sufficient, so you may need to write your own tablet load
balancer that is aware of your partitioning strategy.
On Tue, Oct 30, 2012 at 3:06 PM, Krishmin Rai <[EMAIL PROTECTED]> wrote:
> Hi All,
> We're working with an index table whose row is a shardId (an integer,
> like the wiki-search or IndexedDoc examples). I was just wondering what the
> right strategy is for choosing a number of partitions, particularly given a
> cluster that could potentially grow.
> If I simply set the number of shards equal to the number of slave nodes,
> additional nodes would not improve query performance (at least over the
> data already ingested). But starting with more partitions than slave nodes
> would result in multiple tablets per tablet server… I'm not really sure how
> that would impact performance, particularly given that all queries against
> the table will be batchscanners with an infinite range.
> Just wondering how others have addressed this problem, and if there are
> any performance rules of thumb regarding the ratio of tablets to tablet