I'm trying to improve ingest performance on a 12 node test cluster. Currently I'm loading 5 billion records in approximately 70 minutes which seems excessive. Monitoring the job there are 2600 map jobs (there is no reduce stage, just the mapper) with 288 running at any one time. The performance seems slowest in the early stages of the job prior to to min or maj compactions occuring. Each server has 48 GB memory and currently the accumulo settings are based on the 3GB settings in the example config directory, ie tserver.memory.maps.max = 1GB, tserver.cache.index.site=50M and tserver.cache.index.site=512M. All other settings on the table are default.
1. What is Accumulo doing in the initial stage of a load and which configurations should I focus on to improve this?
2. At what ingest rate should I consider using the bulk ingest process with rfiles?
IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.