Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - BatchWriter performance on 1.4


Copy link to this message
-
Re: BatchWriter performance on 1.4
Keith Turner 2013-09-20, 18:43
On Fri, Sep 20, 2013 at 12:47 PM, Slater, David M.
<[EMAIL PROTECTED]>wrote:

> I was using flush() after sending a bunch of mutations to the batchwriters
> to limit their latency. I thought it would normally flush the buffer to
> ensure that the maxLatency is not violated. If the maxLatency is quite
> large, how do I ensure that it doesn’t wait a long time before writing?
>

If you are constantly writing a batch writer, then it will be continually
flushing.   The example debug output I posted was from running
org.apache.accumulo.test.TestIngest (may be in a another package before
1.6).  I ran the following command to write a million random mutations.

accumulo org.apache.accumulo.test.TestIngest --debug -u root -p secret
--timestamp 1 --size 50 --random 56 --rows 1000000 --start 0 --cols 1

I think it defaults to 50M of memory for the batch writer.  It was
continually sending batches of 80K mutations every .45 seconds.   So in
that case the latency of a mutation is probably less than two seconds. But
this is just one tablet server, the behavior would be different on multiple
tablet servers.

In this example if I set the max latency on the batch writer to 30 secs,
then it would never kick in and force a flush.

> ****
>
> ** **
>
> If the returned batchscanners are all thread safe, then I’m still going to
> have the bottleneck of their synchronized addMutations method, correct?
>

In my experience, thats not a bottle neck but you will need to confirm this
for your situation (hopefully the debug output can help you w/ this).   If
the M threads adding mutations to a queue are going at a faster rate than
the N threads taking mutation and sending them, then the in synchronization
around the queue is not the bottleneck.  M threads probably could add to a
synchronized queue at a rate of millions of mutations per second.  N
threads can probably only serialize and send tens or hundreds of thousands
of mutations per second.
> ****
>
> ** **
>
> I’m looking for “org.apache.accumulo.client.impl” in the
> log4j.properties, generic_logger.xml the and other config files, but can’t
> locate it. Do I need to create a new entry for it there?
>

You can add something to a log4j.props file thats on the class path or you
can try adding something like the following to your code.  I had the
package wrong, its correct below.

Logger.getLogger("org.apache.accumulo.core.client.impl").setLevel(Level.TRACE)

****
>
> ** **
>
> Thanks,
> David****
>
> ** **
>
> *From:* Keith Turner [mailto:[EMAIL PROTECTED]]
> *Sent:* Thursday, September 19, 2013 7:01 PM
>
> *To:* [EMAIL PROTECTED]
> *Subject:* Re: BatchWriter performance on 1.4****
>
> ** **
>
> On Thu, Sep 19, 2013 at 5:08 PM, Slater, David M. <[EMAIL PROTECTED]>
> wrote:****
>
> Thanks Keith, I’m looking at it now. It appears like what I would want. As
> for the proper usage…****
>
>  ****
>
> Would I create one using the Connector, ****
>
> then .getBatchWriter() for each of the tables I’m interested in,****
>
> add data to each of BatchWriters returned,****
>
> ** **
>
> yes.****
>
>  ****
>
> and then hit flush() when I want to write all of that to get written?****
>
> ** **
>
> Why are you calling flush() ?   Doing this frequently will increase rpc
> overhead and lower throughput.****
>
>  ****
>
>  ****
>
> Would the individual batch writers spawned by the multiTableBatchWriter
> still have synchronized addMutations() methods so I would have to worry
> about blocking still, or would that all happen at the flush() method?****
>
> ** **
>
> The returned batch writers are thread safe. They all add to the same
> queue/buffer in a synchronized manner.   Calling flush() on any of the
> batch writers returned from getBatchWriter() will block the others.   ****
>
> ** **
>
> If you enable set the log4j log level to TRACE for
> org.apache.accumulo.client.impl you can see output like the following.
>  Binning is the process of taking each mutation and deciding which tablet
> and tablet server it goes to.****