Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Improving ingest performance [SEC=UNCLASSIFIED]


+
Dickson, Matt MR 2013-07-24, 06:26
+
Eric Newton 2013-07-24, 12:35
+
Christopher 2013-07-25, 02:16
+
Jeremy Kepner 2013-07-24, 14:35
Copy link to this message
-
Re: Improving ingest performance [SEC=UNCLASSIFIED]
There can also be significant overhead in starting a MR job if you're using
`-libjars` for distributing your dependencies. This effect is more
pronounced as the number of nodes increases.  I would recommend looking
into the distributed cache (there's a quick description at
http://developer.yahoo.com/hadoop/tutorial/module5.html, googling some more
will probably get your details on the subject). This is especially helpful
if you plan on running the same job repeatedly without changing the
dependencies often.
On Wed, Jul 24, 2013 at 10:35 AM, Jeremy Kepner <[EMAIL PROTECTED]> wrote:

> (5,000,000,000 records) x (~10 entries/record) /
> ((12 nodes) x (70 minutes) x (60 seconds/minute))
>
> = ~100,000 entries/sec/node
>
> This is consistent with other published results
>
> On Wed, Jul 24, 2013 at 02:26:18AM -0400, Dickson, Matt MR wrote:
> >    UNCLASSIFIED
> >
> >    Hi,
> >
> >    I'm trying to improve ingest performance on a 12 node test cluster.
> >    Currently I'm loading 5 billion records in approximately 70 minutes
> which
> >    seems excessive.  Monitoring the job there are 2600 map jobs (there
> is no
> >    reduce stage, just the mapper) with 288 running at any one time.  The
> >    performance seems slowest in the early stages of the job prior to to
> min
> >    or maj compactions occuring.  Each server has 48 GB memory and
> currently
> >    the accumulo settings are based on the 3GB settings in the example
> config
> >    directory, ie tserver.memory.maps.max = 1GB,
> tserver.cache.index.site=50M
> >    and tserver.cache.index.site=512M.  All other settings on the table
> are
> >    default.
> >
> >    Questions.
> >
> >    1. What is Accumulo doing in the initial stage of a load and which
> >    configurations should I focus on to improve this?
> >    2. At what ingest rate should I consider using the bulk ingest process
> >    with rfiles?
> >
> >    Thanks
> >    Matt
> >
> >    IMPORTANT: This email remains the property of the Department of
> Defence
> >    and is subject to the jurisdiction of section 70 of the Crimes Act
> 1914.
> >    If you have received this email in error, you are requested to
> contact the
> >    sender and delete the email.
>