Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Running Continuous Ingest on small cluster


Copy link to this message
-
Re: Running Continuous Ingest on small cluster
It depends on the number of drives you have, too.

I run ingesters and scanners on every slave node, a single batchwalker on
the master node.

You want at least 100K for outgoing buffers for your ingester for each
slave node you have.

A large in-memory map is probably less useful than block index to get your
query performance to be faster.

Once you start getting to 5G / tablet, the number of files per tablet
causes merging minor compactions, which cuts your performance in half.
 Increasing the number of files will reduce query performance, so that
gives you a basic way to control ingest vs query performance.

Pre-splitting to 20 tablets/server will give you the sweet-spot for ingest
performance; add more if you have more drives.  It allows for for more
parallel writes during minor compactions.

If you have more than 4 drives per node, try doubling the number of
ingesters you run.

I like to tweak everything until I get the system load on each machine to
be roughly the number of real cores after 12 hours.  This is hard to do
without a spindle for every two CPUs.

It's important to watch for failed/failing hardware.  You can sample the
outgoing write buffers of the ingesters (using netstat).  If you see a node
constantly having data queued going to it, try taking it out of the
cluster.  You can do the same for datanodes, too.  At dozens of nodes, this
is not really important.  When you get to hundreds, it becomes much more
likely that a single node will flake out after a day of abuse.

-Eric
On Wed, Dec 19, 2012 at 8:10 PM, Josh Elser <[EMAIL PROTECTED]> wrote:

> In playing around with the continuous ingest collection of code (ingest,
> walkers, batchwalkers, scanners and agitators), I found myself blindly
> guessing at how many of each of these processes I should use.
>
> Are there some generic thoughts as to what might be an ideal saturation
> point for N tservers?
>
> I initially split my hosts 4 ways and ran (N/4) of each process (ingest,
> walkers, batchwalkers, and scanners), ratcheting down the number of threads
> ingest and batchwalkers (to avoid saturating CPU and memory). Should I try
> to balance (query threads * query clients) + (ingest threads * ingest
> clients) against the available threads per host and adjust the BatchWriter
> send buffers similarly in regard to memory available?
>
> I appreciate anyone's insight.
>
> - Josh
>