Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Tuning & Compactions


+
Chris Burrell 2012-11-28, 19:49
+
Eric Newton 2012-11-28, 20:31
Copy link to this message
-
Re: Tuning & Compactions
Thanks for all the comments below. Very helpful!

On the last point, around "small indexes", do you mean if your set of keys
is small, but having many column-families and column qualifiers? What order
of magnitude would you consider to be small? A few million keys/billion
keys? Or in another way, keys with 10s/100s of column families/qualifiers.

I have another question around the use of column families and qualifiers.
Would it be good or bad practice to have many column families/qualifiers
per row.  I was just wondering if there would be any point in using these
almost as extensions to the keys, i.e. the column family/qualifier would
end up being the last part of the key. I understand column families can
also be used to control how the data gets stored to maximize scanning too.
I was just wondering if there would be drawbacks on having many of these.

Chris

On 28 November 2012 20:31, Eric Newton <[EMAIL PROTECTED]> wrote:

> Some comments inlined below:
>
> On Wed, Nov 28, 2012 at 2:49 PM, Chris Burrell <[EMAIL PROTECTED]>wrote:
>
>> Hi
>>
>> I am trialling Accumulo on a small (tiny) cluster and wondering how the
>> best way to tune it would be. I have 1 master + 2 tservers. The master has
>> 8Gb of RAM and the tservers have each 16Gb each.
>>
>> I have set the walogs size to be 2Gb with an external memory map of 9G.
>> The ratio is still the defaulted to 3. I've also upped the heap sizes of
>> each tserver to 2Gb heaps.
>>
>> I'm trying to achieve high-speed ingest via batch writers held on several
>> other servers. I'm loading two separate tables.
>>
>> Here are some questions I have:
>> - Does the config above sound sensible? or overkill?
>>
>
> Looks good to me, assuming you aren't doing other things (like map/reduce)
> on the machines.
>
>
>> - Is it preferable to have more servers with lower specs?
>>
> Yes.  Mostly to get more drives.
>
>
>> - Is this the best way to maximise use of the memory?
>>
> It's not bad.  You may want to have larger block caches and a smaller
> in-memory map.  But if you want to write-mostly, read-little, this is good.
>
>
>> - Does the fact I have 3x2Gb walogs, means that the remaining 3Gb in the
>> external memory map can be used while compactions occur?
>>
>
> Yes.  You will want to increase the size or number of logs.  With that
> many servers, failures will hopefully be very rare.  I would go with
> changing 3 to 8.  Having lots of logs on a tablet is no big deal if you
> have disk space, and don't expect many failures.
>
>
>> - When minor compactions occur, does this halt ingest on that particular
>> tablet? or tablet server?
>>
> Only if memory fills before the compactions finish. The monitor page will
> indicate this by displaying "hold time."  When this happens the tserver
> will self-tune and start minor compactions earlier with future ingest.
>
>
>> - I have pre-split the tables six-ways, but not entirely sure if that's
>> preferable if I only have 2 servers while trying it out? Perhaps 2 ways
>> might be better?
>>
> Not for that reason, but to be able to use more cores concurrently.  Aim
> for 50-100 tablets/node.
>
>
>> - Does the batch upload through the shell client give significantly
>> better performance stats?
>>
>
> Using map/reduce to create RFiles is more efficient. But it also increases
> latency: you only can see the data when the whole file is loaded.
>
> When a file is batch-loaded, its index is read, and the file is assigned
> to matching tablets.  With small indexes, you can batch-load terabytes in
> minutes.
>
> -Eric
>
>