On 6/3/13 9:54 AM, "Flavio Junqueira" <[EMAIL PROTECTED]> wrote:
>On Jun 3, 2013, at 12:41 AM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
>> From my understanding, ZooKeeper currently maintains data integrity by
>> validating all the data before loading it in to memory. Disk-related
>> errors on one of the machine won't affect the correctness of the
>> since we are serving client or peer request from in-memory data only.
>Let me try to be a bit more concrete. Say that we corrupt arbitrarily a
>txn T in a log file, and that T has been acknowledged by 3 servers (S1,
>S2, S3) in an ensemble of 5 servers (S1, S2, S3, S4, S5). Let's assume
>that S3 has corrupted T in its log. Next say that S5 becomes the leader
>supported by S3 and S4 (S3 has restarted). We can elect S5 because it has
>the same history as S3 and S3 has corrupted T (we ignore any transaction
>it may have after T), which S5 doesn't have. If this can happen, then we
>lost T even though T has been acknowledged by a quorum.
>In any case, I'm interested in defining precisely what integrity
>guarantees we provide for txn logs/snapshots. The point I was trying to
>convey is that we can't tolerate arbitrary corruptions of the txn log. We
>can only tolerate (and I'm not convinced there is a reason to push it
>further) corruption of a suffix of the txn log that has not been
>acknowledged and the txns in this suffix haven't been acknowledged
>because the server crashed before they have been completely flushed to
I believe the problem you are describing here is essentially the fact that
we have more failure than we can tolerate. Ideally, if S1 or S2
participated in the next round of leader election, S1 or S2 should be
elected as a leader because they have the highest zxid. S3 has txnlog
corruption at T so it should reports its zxid as T-1 during leader
Because of how leader election works, a corruption in less than a majority
should not affect the correctness. However, in ZK-1413,ZK-22,ZK-1416, a
server use its local txnlog to response to a request. So they are
vulnerable to a single machine disk corruption or operator error. However,
it won't affect correctness if we can detect the corruption correctly.
>> However, in ZK-1413. The leader use on-disk txnlog to synchronize with
>> learner. It seem like we have to keep checking txnlog integrity every
>> we read something from disk. And I don't think integrity check is cheap
>> too since we have to scan the entire history (starting from a given
>For the average case, this might not be too bad. If I remember correctly,
>it is possible to calibrate the amount of transactions a server is
>willing to read from disk when deciding whether to send a snapshot.
>> If we cache txnlog in memory, we only need to do integrity check once
>> we can also built some indexes on top of it to support more efficient
>> lookup. However, this is going to consume a lot of memory.
>Agreed, although I'd rather generate a few numbers before we claim it is
>bad and that we need a cache
For 1413, the current implementation works fine if the parameters are
configured appropriately. I mentioned caching because other features like
ZK-22 or ZK-1416 might need this. If we ever need to modify txnlog
facility we can think of way to solve problems for other features has
>> On the other hand, these features (ZK-1413,ZK-22,ZK-1416) don't really
>> need the entire txnlog to be valid. The server can always say to the
>> client that the history needed to answer the request is too old and
>> is fall back mechanism that allows system to make progress correctly.
>> From example, in ZK-1413, the leader can fall back to send a snapshot to
>> the learner if it cannot use txnlog due to any reason.
>Sure, this covers some cases, but I don't see how it covers the case
>above. I think it doesn't, right?
>> Thawan Kooburat