Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - Running Continuous Ingest on small cluster


+
Josh Elser 2012-12-20, 01:10
+
Eric Newton 2012-12-20, 01:59
Copy link to this message
-
Re: Running Continuous Ingest on small cluster
Josh Elser 2012-12-20, 02:02
Thanks for the great info, Eric!

On 12/19/2012 8:59 PM, Eric Newton wrote:
> It depends on the number of drives you have, too.
>
> I run ingesters and scanners on every slave node, a single batchwalker
> on the master node.
>
> You want at least 100K for outgoing buffers for your ingester for each
> slave node you have.
>
> A large in-memory map is probably less useful than block index to get
> your query performance to be faster.
>
> Once you start getting to 5G / tablet, the number of files per tablet
> causes merging minor compactions, which cuts your performance in half.
>   Increasing the number of files will reduce query performance, so that
> gives you a basic way to control ingest vs query performance.
>
> Pre-splitting to 20 tablets/server will give you the sweet-spot for
> ingest performance; add more if you have more drives.  It allows for for
> more parallel writes during minor compactions.
>
> If you have more than 4 drives per node, try doubling the number of
> ingesters you run.
>
> I like to tweak everything until I get the system load on each machine
> to be roughly the number of real cores after 12 hours.  This is hard to
> do without a spindle for every two CPUs.
>
> It's important to watch for failed/failing hardware.  You can sample the
> outgoing write buffers of the ingesters (using netstat).  If you see a
> node constantly having data queued going to it, try taking it out of the
> cluster.  You can do the same for datanodes, too.  At dozens of nodes,
> this is not really important.  When you get to hundreds, it becomes much
> more likely that a single node will flake out after a day of abuse.
>
> -Eric
>
>
> On Wed, Dec 19, 2012 at 8:10 PM, Josh Elser <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     In playing around with the continuous ingest collection of code
>     (ingest, walkers, batchwalkers, scanners and agitators), I found
>     myself blindly guessing at how many of each of these processes I
>     should use.
>
>     Are there some generic thoughts as to what might be an ideal
>     saturation point for N tservers?
>
>     I initially split my hosts 4 ways and ran (N/4) of each process
>     (ingest, walkers, batchwalkers, and scanners), ratcheting down the
>     number of threads ingest and batchwalkers (to avoid saturating CPU
>     and memory). Should I try to balance (query threads * query clients)
>     + (ingest threads * ingest clients) against the available threads
>     per host and adjust the BatchWriter send buffers similarly in regard
>     to memory available?
>
>     I appreciate anyone's insight.
>
>     - Josh
>
>