Some comments inlined below:
On Wed, Nov 28, 2012 at 2:49 PM, Chris Burrell <[EMAIL PROTECTED]> wrote:
> I am trialling Accumulo on a small (tiny) cluster and wondering how the
> best way to tune it would be. I have 1 master + 2 tservers. The master has
> 8Gb of RAM and the tservers have each 16Gb each.
> I have set the walogs size to be 2Gb with an external memory map of 9G.
> The ratio is still the defaulted to 3. I've also upped the heap sizes of
> each tserver to 2Gb heaps.
> I'm trying to achieve high-speed ingest via batch writers held on several
> other servers. I'm loading two separate tables.
> Here are some questions I have:
> - Does the config above sound sensible? or overkill?
Looks good to me, assuming you aren't doing other things (like map/reduce)
on the machines.
> - Is it preferable to have more servers with lower specs?
Yes. Mostly to get more drives.
> - Is this the best way to maximise use of the memory?
It's not bad. You may want to have larger block caches and a smaller
in-memory map. But if you want to write-mostly, read-little, this is good.
> - Does the fact I have 3x2Gb walogs, means that the remaining 3Gb in the
> external memory map can be used while compactions occur?
Yes. You will want to increase the size or number of logs. With that many
servers, failures will hopefully be very rare. I would go with changing 3
to 8. Having lots of logs on a tablet is no big deal if you have disk
space, and don't expect many failures.
> - When minor compactions occur, does this halt ingest on that particular
> tablet? or tablet server?
Only if memory fills before the compactions finish. The monitor page will
indicate this by displaying "hold time." When this happens the tserver
will self-tune and start minor compactions earlier with future ingest.
> - I have pre-split the tables six-ways, but not entirely sure if that's
> preferable if I only have 2 servers while trying it out? Perhaps 2 ways
> might be better?
Not for that reason, but to be able to use more cores concurrently. Aim
for 50-100 tablets/node.
> - Does the batch upload through the shell client give significantly better
> performance stats?
Using map/reduce to create RFiles is more efficient. But it also increases
latency: you only can see the data when the whole file is loaded.
When a file is batch-loaded, its index is read, and the file is assigned to
matching tablets. With small indexes, you can batch-load terabytes in