The two key points I can extract from this discussion and please feel free to add to my list are:
- We can't tolerate arbitrary corruption of log entries. We can tolerate corruption of a log suffix due to a crash, in which case the txns in the log have not been acknowledged.
- Verifying digests is possibly expensive, so we might need to look at ways to avoid the performance penalty, like caching txns in memory.
One more comment below:
On Jun 4, 2013, at 9:22 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
> On 6/3/13 9:54 AM, "Flavio Junqueira" <[EMAIL PROTECTED]> wrote:
>> On Jun 3, 2013, at 12:41 AM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
>>> From my understanding, ZooKeeper currently maintains data integrity by
>>> validating all the data before loading it in to memory. Disk-related
>>> errors on one of the machine won't affect the correctness of the
>>> since we are serving client or peer request from in-memory data only.
>> Let me try to be a bit more concrete. Say that we corrupt arbitrarily a
>> txn T in a log file, and that T has been acknowledged by 3 servers (S1,
>> S2, S3) in an ensemble of 5 servers (S1, S2, S3, S4, S5). Let's assume
>> that S3 has corrupted T in its log. Next say that S5 becomes the leader
>> supported by S3 and S4 (S3 has restarted). We can elect S5 because it has
>> the same history as S3 and S3 has corrupted T (we ignore any transaction
>> it may have after T), which S5 doesn't have. If this can happen, then we
>> lost T even though T has been acknowledged by a quorum.
>> In any case, I'm interested in defining precisely what integrity
>> guarantees we provide for txn logs/snapshots. The point I was trying to
>> convey is that we can't tolerate arbitrary corruptions of the txn log. We
>> can only tolerate (and I'm not convinced there is a reason to push it
>> further) corruption of a suffix of the txn log that has not been
>> acknowledged and the txns in this suffix haven't been acknowledged
>> because the server crashed before they have been completely flushed to
> I believe the problem you are describing here is essentially the fact that
> we have more failure than we can tolerate. Ideally, if S1 or S2
> participated in the next round of leader election, S1 or S2 should be
> elected as a leader because they have the highest zxid. S3 has txnlog
> corruption at T so it should reports its zxid as T-1 during leader
> Because of how leader election works, a corruption in less than a majority
> should not affect the correctness. However, in ZK-1413,ZK-22,ZK-1416, a
> server use its local txnlog to response to a request. So they are
> vulnerable to a single machine disk corruption or operator error. However,
> it won't affect correctness if we can detect the corruption correctly.
In the case I described above, there was a corruption only in one server and yet it caused a problematic scenario. I don't think we can claim that we can tolerate corruption of a minority. One single corruption might be problematic already.
>>> However, in ZK-1413. The leader use on-disk txnlog to synchronize with
>>> learner. It seem like we have to keep checking txnlog integrity every
>>> we read something from disk. And I don't think integrity check is cheap
>>> too since we have to scan the entire history (starting from a given
>> For the average case, this might not be too bad. If I remember correctly,
>> it is possible to calibrate the amount of transactions a server is
>> willing to read from disk when deciding whether to send a snapshot.
>>> If we cache txnlog in memory, we only need to do integrity check once
>>> we can also built some indexes on top of it to support more efficient
>>> lookup. However, this is going to consume a lot of memory.
>> Agreed, although I'd rather generate a few numbers before we claim it is
>> bad and that we need a cache
> For 1413, the current implementation works fine if the parameters are