Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Improving ingest performance [SEC=UNCLASSIFIED]


Copy link to this message
-
Re: Improving ingest performance [SEC=UNCLASSIFIED]
Christopher 2013-07-25, 02:16
Eric, all this info would be great as a FAQ on the website. :)

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
On Wed, Jul 24, 2013 at 8:35 AM, Eric Newton <[EMAIL PROTECTED]> wrote:
> Assuming that 5 billion records means 5 billion Key/Values, this is nearly
> 100K K-V/sec/node, which isn't so bad.  If the key/values are small and
> uniformly distributed, 200K is closer to the rate you can expect given
> decent drives.
>
> If you are only concerned with ingest, crank up the size of the in-memory
> map and increase the compaction ratio from 3 to 5 (or even as high as 10).
> This will reduce the number of re-writes of your data.  If you don't care
> about possible data loss, turn off the write-ahead log on your table, or
> reduce the replication factor for the write-ahead log.
>
> Make sure your table is pre-split, if possible, to maximize parallel
> performance during initial ingest.  Aim for 10-50 tablets per server.
>
> If the latency of waiting for the data to be prepped does not bother you, it
> is almost always more efficient to use bulk ingest.   Can you wait 30
> minutes to queue up enough data, and then another 5-15 for the map/reduce
> job to produce the RFiles?
>
> There's a fair amount of overhead to starting a mapper.  You may want to
> experiment with larger map jobs.
>
> -Eric
>
>
> On Wed, Jul 24, 2013 at 2:26 AM, Dickson, Matt MR
> <[EMAIL PROTECTED]> wrote:
>>
>> UNCLASSIFIED
>>
>> Hi,
>>
>> I'm trying to improve ingest performance on a 12 node test cluster.
>> Currently I'm loading 5 billion records in approximately 70 minutes which
>> seems excessive.  Monitoring the job there are 2600 map jobs (there is no
>> reduce stage, just the mapper) with 288 running at any one time.  The
>> performance seems slowest in the early stages of the job prior to to min or
>> maj compactions occuring.  Each server has 48 GB memory and currently the
>> accumulo settings are based on the 3GB settings in the example config
>> directory, ie tserver.memory.maps.max = 1GB, tserver.cache.index.site=50M
>> and tserver.cache.index.site=512M.  All other settings on the table are
>> default.
>>
>> Questions.
>>
>> 1. What is Accumulo doing in the initial stage of a load and which
>> configurations should I focus on to improve this?
>> 2. At what ingest rate should I consider using the bulk ingest process
>> with rfiles?
>>
>> Thanks
>> Matt
>>
>> IMPORTANT: This email remains the property of the Department of Defence
>> and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If
>> you have received this email in error, you are requested to contact the
>> sender and delete the email.
>
>