Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Running Continuous Ingest on small cluster


Copy link to this message
-
Re: Running Continuous Ingest on small cluster
Thanks for the great info, Eric!

On 12/19/2012 8:59 PM, Eric Newton wrote:
> It depends on the number of drives you have, too.
>
> I run ingesters and scanners on every slave node, a single batchwalker
> on the master node.
>
> You want at least 100K for outgoing buffers for your ingester for each
> slave node you have.
>
> A large in-memory map is probably less useful than block index to get
> your query performance to be faster.
>
> Once you start getting to 5G / tablet, the number of files per tablet
> causes merging minor compactions, which cuts your performance in half.
>   Increasing the number of files will reduce query performance, so that
> gives you a basic way to control ingest vs query performance.
>
> Pre-splitting to 20 tablets/server will give you the sweet-spot for
> ingest performance; add more if you have more drives.  It allows for for
> more parallel writes during minor compactions.
>
> If you have more than 4 drives per node, try doubling the number of
> ingesters you run.
>
> I like to tweak everything until I get the system load on each machine
> to be roughly the number of real cores after 12 hours.  This is hard to
> do without a spindle for every two CPUs.
>
> It's important to watch for failed/failing hardware.  You can sample the
> outgoing write buffers of the ingesters (using netstat).  If you see a
> node constantly having data queued going to it, try taking it out of the
> cluster.  You can do the same for datanodes, too.  At dozens of nodes,
> this is not really important.  When you get to hundreds, it becomes much
> more likely that a single node will flake out after a day of abuse.
>
> -Eric
>
>
> On Wed, Dec 19, 2012 at 8:10 PM, Josh Elser <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     In playing around with the continuous ingest collection of code
>     (ingest, walkers, batchwalkers, scanners and agitators), I found
>     myself blindly guessing at how many of each of these processes I
>     should use.
>
>     Are there some generic thoughts as to what might be an ideal
>     saturation point for N tservers?
>
>     I initially split my hosts 4 ways and ran (N/4) of each process
>     (ingest, walkers, batchwalkers, and scanners), ratcheting down the
>     number of threads ingest and batchwalkers (to avoid saturating CPU
>     and memory). Should I try to balance (query threads * query clients)
>     + (ingest threads * ingest clients) against the available threads
>     per host and adjust the BatchWriter send buffers similarly in regard
>     to memory available?
>
>     I appreciate anyone's insight.
>
>     - Josh
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB