On Jun 3, 2013, at 12:41 AM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
> From my understanding, ZooKeeper currently maintains data integrity by
> validating all the data before loading it in to memory. Disk-related
> errors on one of the machine won't affect the correctness of the ensemble
> since we are serving client or peer request from in-memory data only.
Let me try to be a bit more concrete. Say that we corrupt arbitrarily a txn T in a log file, and that T has been acknowledged by 3 servers (S1, S2, S3) in an ensemble of 5 servers (S1, S2, S3, S4, S5). Let's assume that S3 has corrupted T in its log. Next say that S5 becomes the leader supported by S3 and S4 (S3 has restarted). We can elect S5 because it has the same history as S3 and S3 has corrupted T (we ignore any transaction it may have after T), which S5 doesn't have. If this can happen, then we lost T even though T has been acknowledged by a quorum.
In any case, I'm interested in defining precisely what integrity guarantees we provide for txn logs/snapshots. The point I was trying to convey is that we can't tolerate arbitrary corruptions of the txn log. We can only tolerate (and I'm not convinced there is a reason to push it further) corruption of a suffix of the txn log that has not been acknowledged and the txns in this suffix haven't been acknowledged because the server crashed before they have been completely flushed to disk.
> However, in ZK-1413. The leader use on-disk txnlog to synchronize with the
> learner. It seem like we have to keep checking txnlog integrity every time
> we read something from disk. And I don't think integrity check is cheap
> too since we have to scan the entire history (starting from a given zxid).
For the average case, this might not be too bad. If I remember correctly, it is possible to calibrate the amount of transactions a server is willing to read from disk when deciding whether to send a snapshot.
> If we cache txnlog in memory, we only need to do integrity check once and
> we can also built some indexes on top of it to support more efficient
> lookup. However, this is going to consume a lot of memory.
Agreed, although I'd rather generate a few numbers before we claim it is bad and that we need a cache
> On the other hand, these features (ZK-1413,ZK-22,ZK-1416) don't really
> need the entire txnlog to be valid. The server can always say to the
> client that the history needed to answer the request is too old and there
> is fall back mechanism that allows system to make progress correctly.
> From example, in ZK-1413, the leader can fall back to send a snapshot to
> the learner if it cannot use txnlog due to any reason.
Sure, this covers some cases, but I don't see how it covers the case above. I think it doesn't, right?
> Thawan Kooburat
> On 6/1/13 8:18 AM, "Flavio Junqueira" <[EMAIL PROTECTED]> wrote:
>> I think this discussion has been triggered by a discussion we have had
>> for ZOOKEEPER-1413. In the patch Thawan proposed there, there was a
>> method reads txn logs and it simply logs an error in the case of an
>> exception while reading the log. I raised the question of whether we
>> should do more than simply logging an error message and the discussion
>> about txn log started, but it seems to be a discussion that is out of the
>> scope of 1413, so we thought it would be good to have this discussion
>> Here are a few thoughts about the issue. We can't really tolerate
>> arbitrary corruptions of the txn log because it could imply that we lose
>> quorum for a txn that has been processed and a response has been returned
>> to the client. In the case that a faulty server only partially writes a
>> txn into a txn log because it crashes, the logged txn is corrupt, but we
>> don't really have an issue because the server has not acked the txn, so
>> if there is a quorum for that txn, the faulty server is not really part
>> of it. Cases like this I believe we can do something about, but more