|
|
-
Read/Write Invariants Questions
Sukant Hajra 2012-05-16, 04:15
Hi,
There's a couple of sanity checks I wanted to run by the list:
1. I see in the documentation that mutations may be partially read unless using IsolatedScanners, which is a way to have atomicity for applications. Is there any other mechanism for atomic operations to know about?
2. I'm assuming that a flushed write to a row is not guaranteed to be sensed by a subsequent read (no immediate consistency). Is this correct?
3. When using a BatchWriter does the order in which mutations are added make any reliable assertion on the order that these mutations are sensed by subsequent reads? Given two mutations A and B, I'd like to assert that any node sensing B will also sense A.
4. I'm going to have a long standing thread doing batch writing. Is it reasonable/safe to give this thread an open BatchWriter (making sure to close the writer when shutting down the thread)? Or might this cause a memory leak?
5. I'm assuming that BatchWriter is minimally blocking. Is there any merit to or precedent of load balancing across multiple writers? Or would that be redundant to optimizations already built into BatchWriter?
Thanks a lot for helping me better understand Accumulo. Feel free to point me to documentation I might have missed.
-Sukant
-
Re: Read/Write Invariants Questions
Keith Turner 2012-05-16, 07:42
On Wed, May 16, 2012 at 12:15 AM, Sukant Hajra <[EMAIL PROTECTED]> wrote: > Hi, > > There's a couple of sanity checks I wanted to run by the list: > > 1. I see in the documentation that mutations may be partially read unless > using IsolatedScanners, which is a way to have atomicity for applications. > Is there any other mechanism for atomic operations to know about?
For the batch scanner take a look at the WholeRowIterator and the batch scanner java docs.
> > 2. I'm assuming that a flushed write to a row is not guaranteed to be > sensed by a subsequent read (no immediate consistency). Is this correct?
After a call to flush() on a batchwriter returns, any mutations written before the call to flush should be immediately visible.
> > 3. When using a BatchWriter does the order in which mutations are added > make any reliable assertion on the order that these mutations are sensed by > subsequent reads? Given two mutations A and B, I'd like to assert that any > node sensing B will also sense A.
No, the order does not matter. The batch writer will have multiple background threads writing mutations to different tablet servers. So the mutations will become visible at different times irrespective of the order you add them. For the A and B case, you could write both mutations and then call flush. After the flush, both will be visible. However during the flush operation one may be visible and the other not visible.
> > 4. I'm going to have a long standing thread doing batch writing. Is it > reasonable/safe to give this thread an open BatchWriter (making sure to > close the writer when shutting down the thread)? Or might this cause a > memory leak?
When you close a batchwrite it flushes any data it has in memory and shuts down its thread pool.
> > 5. I'm assuming that BatchWriter is minimally blocking. Is there any merit > to or precedent of load balancing across multiple writers? Or would that > be redundant to optimizations already built into BatchWriter?
Its safe for multiple threads to use one batchwriter. This may be more optimal up to the point were there are so many threads that it causes lock contention. The nice thing about having multiple threads share one batch writer is that the background threads sending data to tablet severs will presumably have larger batches. This should result in less network round trips. It also allows large batches for the write ahead log on the server side. Write ahead log batching should be less of a concern in 1.5 w/ group commit.
> > Thanks a lot for helping me better understand Accumulo. Feel free to point me > to documentation I might have missed. > > -Sukant
-
Re: Read/Write Invariants Questions
Sukant Hajra 2012-05-16, 20:32
Excerpts from Keith Turner's message of Wed May 16 02:42:17 -0500 2012: > > After a call to flush() on a batchwriter returns, any mutations > written before the call to flush should be immediately visible.
I don't want to belabor the point, but I just want to be sure I'm not interpreting your response too casually. From your response, I'm now under the impression that a flush blocks until the server sends back an acknowledgment that the mutation has been written to the log. Then all subsequent reads look not only at HDFS, but also the write logs to make sure they have the most consistent view? Is this the case? I appreciate the confirmation to save me a dig into the source code.
If the reads are truly immediately consistent, has there ever been talk of making inconsistent reads for the sake of improving read times? Or is it all in the noise with respect to network speeds and not worth the effort?
Also, if flush blocks waiting for an acknowledgment, I'm assuming that the writer will throw a MutationsRejectedException. If this happens, is the BatchWriter still usable? Or should I close it out and get a new one? The connector should be fine, though, right? I'm just trying to make sure I have my error handling logic sanely configured.
Other than that, thanks a lot for your prompt responses. They really helped.
-Sukant
-
Re: Read/Write Invariants Questions
William Slacum 2012-05-16, 20:39
To answer the first part, the batch writer will block until it finishes flushing. The mutations are applied to the write ahead log and an in memory map on the tserver. The write ahead logs are used for failover, and the resulting keys are kept in the memory map until the tserver flushes (minor compaction) or reorganizes its RFiles (major compaction). To dig through the source, look at the TabletServerBatchWriter.
On Wed, May 16, 2012 at 1:32 PM, Sukant Hajra <[EMAIL PROTECTED]> wrote: > Excerpts from Keith Turner's message of Wed May 16 02:42:17 -0500 2012: >> >> After a call to flush() on a batchwriter returns, any mutations >> written before the call to flush should be immediately visible. > > I don't want to belabor the point, but I just want to be sure I'm not > interpreting your response too casually. From your response, I'm now under the > impression that a flush blocks until the server sends back an acknowledgment > that the mutation has been written to the log. Then all subsequent reads look > not only at HDFS, but also the write logs to make sure they have the most > consistent view? Is this the case? I appreciate the confirmation to save me a > dig into the source code. > > If the reads are truly immediately consistent, has there ever been talk of > making inconsistent reads for the sake of improving read times? Or is it all > in the noise with respect to network speeds and not worth the effort? > > Also, if flush blocks waiting for an acknowledgment, I'm assuming that the > writer will throw a MutationsRejectedException. If this happens, is the > BatchWriter still usable? Or should I close it out and get a new one? The > connector should be fine, though, right? I'm just trying to make sure I have > my error handling logic sanely configured. > > Other than that, thanks a lot for your prompt responses. They really helped. > > -Sukant
-
Re: Read/Write Invariants Questions
Keith Turner 2012-05-16, 21:10
On Wed, May 16, 2012 at 4:32 PM, Sukant Hajra <[EMAIL PROTECTED]> wrote: > Excerpts from Keith Turner's message of Wed May 16 02:42:17 -0500 2012: >> >> After a call to flush() on a batchwriter returns, any mutations >> written before the call to flush should be immediately visible. > > I don't want to belabor the point, but I just want to be sure I'm not > interpreting your response too casually. From your response, I'm now under the > impression that a flush blocks until the server sends back an acknowledgment > that the mutation has been written to the log. Then all subsequent reads look > not only at HDFS, but also the write logs to make sure they have the most > consistent view? Is this the case? I appreciate the confirmation to save me a > dig into the source code.
Reads look in the in memory map, not the walog, that Bill mentioned. It sorted and supports efficient lookups.
> > If the reads are truly immediately consistent, has there ever been talk of > making inconsistent reads for the sake of improving read times? Or is it all > in the noise with respect to network speeds and not worth the effort?
Reads are consistent if you call flush beforehand. If you are just streaming a lot of mutations to a batchwriter and not flushing, you do not know when the mutations will be visible. This is a very efficient way to write lots of data. The frequency of flushing will affect the write spead, but should not impact the read speed (other than the fact that really frequent flushing may cause more disk contention between write and read).
> > Also, if flush blocks waiting for an acknowledgment, I'm assuming that the > writer will throw a MutationsRejectedException. If this happens, is the > BatchWriter still usable? Or should I close it out and get a new one? The > connector should be fine, though, right? I'm just trying to make sure I have > my error handling logic sanely configured.
I think it flush throws an exception that it leaves the batch writer in state such that future calls to addMutation( ) will throw an exception.
> > Other than that, thanks a lot for your prompt responses. They really helped. > > -Sukant
|
|