Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> sync semantics

This is an old thread (below), but it doesn't seem like any conclusion
was reached on what we want to do to address the issue.

Reminder of the problem: sync only gets you strong semantics if there
is no leader change. If there is a leader change,
then these semantics are guaranteed only if we make some timing
assumptions, not made elsewhere in ZooKeeper. It would be much
better not to make timing assumptions for such safety/consistency
properties, only for liveness.

The problem happens when your leader is no longer the leader but
doesn't know it yet. He responds to a sync, but that doesn't mean
you follower sees all committed state. Some other server may have
already become the leader and committed some updates, which the sync
won't flush to
your follower, which is still connected to the old leader.

To prevent this we should broadcast the sync like updates, or
piggyback them on other ops, or perhaps create a new type of sync that
is broadcasted.

As Ben pointed out, this problem is also mentioned in Section 4.4 of
the ZooKeeper peper (but the proposed solution there is insufficient
to solve the issue, as
discussed below).

On Fri, Sep 28, 2012 at 4:45 PM, John Carrino <[EMAIL PROTECTED]> wrote:
> Ben, after thinking about this more. I don't think this solution gets the
> property that I need.  Just because there are outstanding proposals that are
> committed later doesn't imply we are still the leader.  It only means that
> when the new leader does recovery it will also see these proposals as
> committed.
> Let's say we have a 5 node cluster and L1 has one pending request out. F2-F5
> are followers. We get back an ack from F2.  Now F5 and L1 are partitioned
> off from the network along with client C1.
> Recovery happens on F2-F4 and F2 becomes L2.  During recovery this proposal
> is accepted because F2 had acked it.  Now L2 does a bunch of stuff including
> deleting your ephemeral node.
> Now a sync comes in from C1 through F5. Now L1 finally gets that ack from F5
> and goes ahead and commits it and responds to the outstanding sync request
> to C1.
> We can see with this ordering there isn't a happens after relationship
> between the sync request and knowing about all commits that occurred before
> the sync request.
> Yes, I realize that this ordering is unlikely to happen in practice, but I
> hate trusting time for anything.
> -jc
> On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <[EMAIL PROTECTED]>
> wrote:
>> This seems like a good compromise.  We still have to eat the latency of a
>> write, but we easily achieve smart batching in this case so many outstanding
>> sync can all be serviced by the same lastPending request.
>> -jc
>> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <[EMAIL PROTECTED]>
>> wrote:
>>> there is a very easy solution to this. we only rely on clocks in the
>>> case that there are no pending transactions. (if there are pending
>>> transactions, the sync will only return if in fact the leader is still
>>> the leader, otherwise the transaction that the sync is waiting on will
>>> never commit and the sync will never return.)
>>> so, if there aren't any transactions, just submit one. make it a bogus
>>> one: create / for example. then queue the sync behind it.
>>> ben
>>> ps - we bring up this issue and the solution and the rational for the
>>> current implementation in section 4.4 of the zookeeper usenix paper.
>>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <[EMAIL PROTECTED]>
>>> wrote:
>>> > So I think it's time to explain what I'm writing just so everyone has
>>> > more
>>> > situation awareness.  Its just a timestamp server, nothing fancy.
>>> >
>>> > Looks like this:
>>> >
>>> > public interface TimestampService {
>>> >     /**
>>> >      * This will get a fresh timestamp that is guarenteed to be newer
>>> > than
>>> > any other timestamp
>>> >      * handed out before this method was called.
>>> >      */
>>> >     long getFreshTimestamp();