Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Advice on increasing ingest rate


Copy link to this message
-
RE: Advice on increasing ingest rate


How many threads are you using in the AccumuloOutputFormat? What is your
latency set to?

 

From: Adam Fuchs [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 08, 2014 5:36 PM
To: [EMAIL PROTECTED]
Subject: Re: Advice on increasing ingest rate

 

MIke,

 

What version of Accumulo are you using, how many tablets do you have, and
how many threads are you using for minor and major compaction pools? Also,
how big are the keys and values that you are using?

 

Here are a few settings that may help you:

1. WAL replication factor (tserver.wal.replication). This defaults to 3
replicas (the HDFS default), but if you set it to 2 it will give you a
performance boost without a huge hit to reliability.

2. Ingest buffer size (tserver.memory.maps.max), also known as the in-memory
map size. Increasing this generally improves the efficiency of minor
compactions and reduces the number of major compactions that will be
required down the line. 4-8 GB is not unreasonable.

3. Make sure your WAL settings are such that the size of a log
(tserver.walog.max.size) multiplied by the number of active logs
(table.compaction.minor.logs.threshold) is greater than the in-memory map
size. You probably want to accomplish this by bumping up the number of
active logs.

4. Increase the buffer size on the BatchWriter that the clients use. This
can be done with the setBatchWriterOptions method on the
AccumuloOutputFormat.

 

Cheers,

Adam

 

 

On Tue, Apr 8, 2014 at 4:47 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:

Hello,

 

We have an ingest process that operates via Map Reduce, processing a large
set of XML files and  inserting mutations based on that data into a set of
tables.

 

On a 5 node cluster (each node has 64G ram, 20 cores, and ~600GB SSD) I get
400k inserts per second with 20 mapper tasks running concurrently.
Increasing the number of concurrent mapper tasks to 40 doesn't have any
effect (besides causing a little more backup in compactions).

 

I've increased the table.compaction.major.ratio and increased the number of
concurrent allowed compactions for both min and max compaction but each of
those only had negligible impact on ingest rates.

 

Any advice on other settings I can tweak to get things to move more quickly?
Or is 400k/second a reasonable ingest rate?  Are we at a point where we
should consider generating r files like the bulk ingest example?

 

Thanks in advance for any advice.

 

Mike