Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Tuning & Compactions


+
Chris Burrell 2012-11-28, 19:49
+
Eric Newton 2012-11-28, 20:31
Copy link to this message
-
Re: Tuning & Compactions
Thanks for all the comments below. Very helpful!

On the last point, around "small indexes", do you mean if your set of keys
is small, but having many column-families and column qualifiers? What order
of magnitude would you consider to be small? A few million keys/billion
keys? Or in another way, keys with 10s/100s of column families/qualifiers.

I have another question around the use of column families and qualifiers.
Would it be good or bad practice to have many column families/qualifiers
per row.  I was just wondering if there would be any point in using these
almost as extensions to the keys, i.e. the column family/qualifier would
end up being the last part of the key. I understand column families can
also be used to control how the data gets stored to maximize scanning too.
I was just wondering if there would be drawbacks on having many of these.

Chris

On 28 November 2012 20:31, Eric Newton <[EMAIL PROTECTED]> wrote:

> Some comments inlined below:
>
> On Wed, Nov 28, 2012 at 2:49 PM, Chris Burrell <[EMAIL PROTECTED]>wrote:
>
>> Hi
>>
>> I am trialling Accumulo on a small (tiny) cluster and wondering how the
>> best way to tune it would be. I have 1 master + 2 tservers. The master has
>> 8Gb of RAM and the tservers have each 16Gb each.
>>
>> I have set the walogs size to be 2Gb with an external memory map of 9G.
>> The ratio is still the defaulted to 3. I've also upped the heap sizes of
>> each tserver to 2Gb heaps.
>>
>> I'm trying to achieve high-speed ingest via batch writers held on several
>> other servers. I'm loading two separate tables.
>>
>> Here are some questions I have:
>> - Does the config above sound sensible? or overkill?
>>
>
> Looks good to me, assuming you aren't doing other things (like map/reduce)
> on the machines.
>
>
>> - Is it preferable to have more servers with lower specs?
>>
> Yes.  Mostly to get more drives.
>
>
>> - Is this the best way to maximise use of the memory?
>>
> It's not bad.  You may want to have larger block caches and a smaller
> in-memory map.  But if you want to write-mostly, read-little, this is good.
>
>
>> - Does the fact I have 3x2Gb walogs, means that the remaining 3Gb in the
>> external memory map can be used while compactions occur?
>>
>
> Yes.  You will want to increase the size or number of logs.  With that
> many servers, failures will hopefully be very rare.  I would go with
> changing 3 to 8.  Having lots of logs on a tablet is no big deal if you
> have disk space, and don't expect many failures.
>
>
>> - When minor compactions occur, does this halt ingest on that particular
>> tablet? or tablet server?
>>
> Only if memory fills before the compactions finish. The monitor page will
> indicate this by displaying "hold time."  When this happens the tserver
> will self-tune and start minor compactions earlier with future ingest.
>
>
>> - I have pre-split the tables six-ways, but not entirely sure if that's
>> preferable if I only have 2 servers while trying it out? Perhaps 2 ways
>> might be better?
>>
> Not for that reason, but to be able to use more cores concurrently.  Aim
> for 50-100 tablets/node.
>
>
>> - Does the batch upload through the shell client give significantly
>> better performance stats?
>>
>
> Using map/reduce to create RFiles is more efficient. But it also increases
> latency: you only can see the data when the whole file is loaded.
>
> When a file is batch-loaded, its index is read, and the file is assigned
> to matching tablets.  With small indexes, you can batch-load terabytes in
> minutes.
>
> -Eric
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB