Eric, all this info would be great as a FAQ on the website. :)
Christopher L Tubbs II
On Wed, Jul 24, 2013 at 8:35 AM, Eric Newton <[EMAIL PROTECTED]> wrote:
> Assuming that 5 billion records means 5 billion Key/Values, this is nearly
> 100K K-V/sec/node, which isn't so bad. If the key/values are small and
> uniformly distributed, 200K is closer to the rate you can expect given
> decent drives.
> If you are only concerned with ingest, crank up the size of the in-memory
> map and increase the compaction ratio from 3 to 5 (or even as high as 10).
> This will reduce the number of re-writes of your data. If you don't care
> about possible data loss, turn off the write-ahead log on your table, or
> reduce the replication factor for the write-ahead log.
> Make sure your table is pre-split, if possible, to maximize parallel
> performance during initial ingest. Aim for 10-50 tablets per server.
> If the latency of waiting for the data to be prepped does not bother you, it
> is almost always more efficient to use bulk ingest. Can you wait 30
> minutes to queue up enough data, and then another 5-15 for the map/reduce
> job to produce the RFiles?
> There's a fair amount of overhead to starting a mapper. You may want to
> experiment with larger map jobs.
> On Wed, Jul 24, 2013 at 2:26 AM, Dickson, Matt MR
> <[EMAIL PROTECTED]> wrote:
>> I'm trying to improve ingest performance on a 12 node test cluster.
>> Currently I'm loading 5 billion records in approximately 70 minutes which
>> seems excessive. Monitoring the job there are 2600 map jobs (there is no
>> reduce stage, just the mapper) with 288 running at any one time. The
>> performance seems slowest in the early stages of the job prior to to min or
>> maj compactions occuring. Each server has 48 GB memory and currently the
>> accumulo settings are based on the 3GB settings in the example config
>> directory, ie tserver.memory.maps.max = 1GB, tserver.cache.index.site=50M
>> and tserver.cache.index.site=512M. All other settings on the table are
>> 1. What is Accumulo doing in the initial stage of a load and which
>> configurations should I focus on to improve this?
>> 2. At what ingest rate should I consider using the bulk ingest process
>> with rfiles?
>> IMPORTANT: This email remains the property of the Department of Defence
>> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
>> you have received this email in error, you are requested to contact the
>> sender and delete the email.