Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> Re: [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.


Copy link to this message
-
Re: [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.
Actually probably accounts for both places. Sounds like you know this
Thawan, question is why do we process a commit that we're not waiting on?
When should we do that?
On Jan 26, 2014 11:28 AM, [EMAIL PROTECTED] wrote:

I think I've identified the problem.
In CommitProcessor line 203, we set the nextPending to null. But we never
set it to null in the else case below on lines 205-210. It doesn't appear
to be removed anywhere else in the file so the processor thinks it is
always waiting for commit and will never continue.
That seems to match the stack anyway. Not sure how this causes the xid
mismatch yet though.
On Jan 24, 2014 10:35 PM, "Thawan Kooburat (JIRA)" <[EMAIL PROTECTED]> wrote:
    [
https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881636#comment-13881636]

Thawan Kooburat commented on ZOOKEEPER-1863:

I have seen a Commit Processor getting stuck in our prod (which run our
internal branch) I spent a few days digging into the problem but couldn't
locate the root cause.

The sequence of action that you put in description is very unlikely to
occur in quorum mode.  First, the Follower/ObserverReuestProcessor which is
in the front of the CommitProcessor put a request into queuedRequests even
before sending it out to the leader.   It need at least a network round
trip ( or a full quorum vote) before the same request will comeback from a
leader and get put into commitRequest.  This is the assumption that even
the original CommitProcessor (prior to ZOOKEEPER-1505) rely on. However, a
combination of bad thread scheduling and long GC pause might break this
assumption.

Sync request is special unlike other write request because it doesn't
require quorum voting, but I still don't think it matter in this case.

Again, since I saw this in prod but I am unable to repro it. I did add a
background thread to detect a request stuck in nextPending for extended
period of time and kill the server if it is the case.  I can post the patch
if we are able unable find the root cause.

You can also capture a heap dump of server to inspect which request get
stuck (at nextPending) and correlated the possible event.
completion, xid mismatch on client.
handler on line 167:
completed on another thread. If queuedRequests is empty, we then exit the
block. Next, before this thread makes any more progress, we can get 2 more
requests, one get_children(say), and a sync placed on queuedRequests for
the processor. Then, if we are very unlucky, the sync request can complete
and this object's commit() routine is called (from
FollowerZookeeperServer), which places the sync request on the previously
empty committedRequests queue. At that point, this thread continues.
completed. We haven't dequeued either the read or the sync request in this
processor. Next, the poll above will pull the sync request off the queue,
and in the following block, the sync will get forwarded to the next
processor.
requests are now out of order.
Thread.sleep(5000) between the two blocks above to make the race condition
far more likely, then in a client program.
mismatches.
can orphan outstanding requests.
be completed, which can wedge the commit processor.

This message was sent by Atlassian JIRA
(v6.1.5#6160)