Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Improving ingest performance [SEC=UNCLASSIFIED]


Copy link to this message
-
Re: Improving ingest performance [SEC=UNCLASSIFIED]
Eric, all this info would be great as a FAQ on the website. :)

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
On Wed, Jul 24, 2013 at 8:35 AM, Eric Newton <[EMAIL PROTECTED]> wrote:
> Assuming that 5 billion records means 5 billion Key/Values, this is nearly
> 100K K-V/sec/node, which isn't so bad.  If the key/values are small and
> uniformly distributed, 200K is closer to the rate you can expect given
> decent drives.
>
> If you are only concerned with ingest, crank up the size of the in-memory
> map and increase the compaction ratio from 3 to 5 (or even as high as 10).
> This will reduce the number of re-writes of your data.  If you don't care
> about possible data loss, turn off the write-ahead log on your table, or
> reduce the replication factor for the write-ahead log.
>
> Make sure your table is pre-split, if possible, to maximize parallel
> performance during initial ingest.  Aim for 10-50 tablets per server.
>
> If the latency of waiting for the data to be prepped does not bother you, it
> is almost always more efficient to use bulk ingest.   Can you wait 30
> minutes to queue up enough data, and then another 5-15 for the map/reduce
> job to produce the RFiles?
>
> There's a fair amount of overhead to starting a mapper.  You may want to
> experiment with larger map jobs.
>
> -Eric
>
>
> On Wed, Jul 24, 2013 at 2:26 AM, Dickson, Matt MR
> <[EMAIL PROTECTED]> wrote:
>>
>> UNCLASSIFIED
>>
>> Hi,
>>
>> I'm trying to improve ingest performance on a 12 node test cluster.
>> Currently I'm loading 5 billion records in approximately 70 minutes which
>> seems excessive.  Monitoring the job there are 2600 map jobs (there is no
>> reduce stage, just the mapper) with 288 running at any one time.  The
>> performance seems slowest in the early stages of the job prior to to min or
>> maj compactions occuring.  Each server has 48 GB memory and currently the
>> accumulo settings are based on the 3GB settings in the example config
>> directory, ie tserver.memory.maps.max = 1GB, tserver.cache.index.site=50M
>> and tserver.cache.index.site=512M.  All other settings on the table are
>> default.
>>
>> Questions.
>>
>> 1. What is Accumulo doing in the initial stage of a load and which
>> configurations should I focus on to improve this?
>> 2. At what ingest rate should I consider using the bulk ingest process
>> with rfiles?
>>
>> Thanks
>> Matt
>>
>> IMPORTANT: This email remains the property of the Department of Defence
>> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
>> you have received this email in error, you are requested to contact the
>> sender and delete the email.
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB