There can also be significant overhead in starting a MR job if you're using
`-libjars` for distributing your dependencies. This effect is more
pronounced as the number of nodes increases. I would recommend looking
into the distributed cache (there's a quick description at
http://developer.yahoo.com/hadoop/tutorial/module5.html, googling some more
will probably get your details on the subject). This is especially helpful
if you plan on running the same job repeatedly without changing the
On Wed, Jul 24, 2013 at 10:35 AM, Jeremy Kepner <[EMAIL PROTECTED]> wrote:
> (5,000,000,000 records) x (~10 entries/record) /
> ((12 nodes) x (70 minutes) x (60 seconds/minute))
> = ~100,000 entries/sec/node
> This is consistent with other published results
> On Wed, Jul 24, 2013 at 02:26:18AM -0400, Dickson, Matt MR wrote:
> > UNCLASSIFIED
> > Hi,
> > I'm trying to improve ingest performance on a 12 node test cluster.
> > Currently I'm loading 5 billion records in approximately 70 minutes
> > seems excessive. Monitoring the job there are 2600 map jobs (there
> is no
> > reduce stage, just the mapper) with 288 running at any one time. The
> > performance seems slowest in the early stages of the job prior to to
> > or maj compactions occuring. Each server has 48 GB memory and
> > the accumulo settings are based on the 3GB settings in the example
> > directory, ie tserver.memory.maps.max = 1GB,
> > and tserver.cache.index.site=512M. All other settings on the table
> > default.
> > Questions.
> > 1. What is Accumulo doing in the initial stage of a load and which
> > configurations should I focus on to improve this?
> > 2. At what ingest rate should I consider using the bulk ingest process
> > with rfiles?
> > Thanks
> > Matt
> > IMPORTANT: This email remains the property of the Department of
> > and is subject to the jurisdiction of section 70 of the Crimes Act
> > If you have received this email in error, you are requested to
> contact the
> > sender and delete the email.