Re: better presplitting
I would encourage the community to figure this our for the following reason.
As other databases adopt Accumulo's security features, Accumulo's
primary feature is performance.
Other NoSQL databases have let performance slide in favor of adding more features.
The gap between Accumulo performance and other NoSQL databases is growing.
There are many applications where Accumulo can do on one node what it would
take 20 or more nodes to do using another technology.
That said, the SQL and NewSQL communities have not been idle and
their are some fairly high performance competitors out there.
In the future, I believe Accumulo's primary performance competition
will come from the SQL and NewSQL communities.
The key to performance is optimization. The key to optimization
is how quickly you can do a performance measurement. The IEEE HPEC
paper was able to get its results because we are able to collect
an accurate performance number at scale in a few minutes.
However, for the largest results, pre-splitting took almost an hour.
If we are able to remove the pre-splitting bottleneck we will
be able to very quickly test performance at scale which will
allow us to maintain Accumulo's impressive performance.
P.S. I should add that the next biggest issue was the WAL, which
we had to turn off because it made things unstable at extreme
insert rate. I think if we solve the pre-splitting issue
it will be a lot easier to attack the WAL issue.
On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote: