Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Tuning & Compactions


Copy link to this message
-
Re: Tuning & Compactions
Some comments inlined below:

On Wed, Nov 28, 2012 at 2:49 PM, Chris Burrell <[EMAIL PROTECTED]> wrote:

> Hi
>
> I am trialling Accumulo on a small (tiny) cluster and wondering how the
> best way to tune it would be. I have 1 master + 2 tservers. The master has
> 8Gb of RAM and the tservers have each 16Gb each.
>
> I have set the walogs size to be 2Gb with an external memory map of 9G.
> The ratio is still the defaulted to 3. I've also upped the heap sizes of
> each tserver to 2Gb heaps.
>
> I'm trying to achieve high-speed ingest via batch writers held on several
> other servers. I'm loading two separate tables.
>
> Here are some questions I have:
> - Does the config above sound sensible? or overkill?
>

Looks good to me, assuming you aren't doing other things (like map/reduce)
on the machines.
> - Is it preferable to have more servers with lower specs?
>
Yes.  Mostly to get more drives.
> - Is this the best way to maximise use of the memory?
>
It's not bad.  You may want to have larger block caches and a smaller
in-memory map.  But if you want to write-mostly, read-little, this is good.
> - Does the fact I have 3x2Gb walogs, means that the remaining 3Gb in the
> external memory map can be used while compactions occur?
>

Yes.  You will want to increase the size or number of logs.  With that many
servers, failures will hopefully be very rare.  I would go with changing 3
to 8.  Having lots of logs on a tablet is no big deal if you have disk
space, and don't expect many failures.
> - When minor compactions occur, does this halt ingest on that particular
> tablet? or tablet server?
>
Only if memory fills before the compactions finish. The monitor page will
indicate this by displaying "hold time."  When this happens the tserver
will self-tune and start minor compactions earlier with future ingest.
> - I have pre-split the tables six-ways, but not entirely sure if that's
> preferable if I only have 2 servers while trying it out? Perhaps 2 ways
> might be better?
>
Not for that reason, but to be able to use more cores concurrently.  Aim
for 50-100 tablets/node.
> - Does the batch upload through the shell client give significantly better
> performance stats?
>

Using map/reduce to create RFiles is more efficient. But it also increases
latency: you only can see the data when the whole file is loaded.

When a file is batch-loaded, its index is read, and the file is assigned to
matching tablets.  With small indexes, you can batch-load terabytes in
minutes.

-Eric