Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> Multiop submitted to non-leader fails


Copy link to this message
-
Multiop submitted to non-leader fails
I'm trying to debug a really obscure problem we've been seeing integrating
multiop into our code base.

After a lot of debugging, I've isolated the error scenario down to when the
multi op gets submitted through any non-leader. The symptoms I see when I
submit the multi-op to a non-leader is that the multi-op, no matter how
small (in this case it only has one op in it to create a single zknode)
always times out. When submitting it to non-leaders, I am seeing a couple of
key messages in the log files:

2011-07-12 13:20:59,168 - WARN  [LearnerHandler-/127.0.0.7:53952:Leader@507][]
- *Commiting zxid 0x100000000 from /127.0.0.7:2182 not first!*
2011-07-12 13:20:59,169 - WARN  [LearnerHandler-/127.0.0.7:53952:Leader@509][]
- *First is 0           *
2011-07-12 13:20:59,169 - INFO  [LearnerHandler-/127.0.0.7:53952:Leader@533][]
- Have quorum of supporters; starting up and setting last processed zxid:
4294967296

2011-07-12 13:20:59,379 - INFO  [NIOServerCxn.Factory:/127.0.0.5:2181
:NIOServerCnxnFactory@197][] - Accepted socket connection from /
127.0.0.5:53553

2011-07-12 13:20:59,381 - WARN  [NIOServerCxn.Factory:/127.0.0.5:2181
:ZooKeeperServer@807][] - Connection request from old client /
127.0.0.5:53553; will be dropped if server is in r-o mode

2011-07-12 13:20:59,381 - INFO  [NIOServerCxn.Factory:/127.0.0.5:2181
:ZooKeeperServer@853][] - Client attempting to establish new session at /
127.0.0.5:53553

2011-07-12 13:20:59,385 - INFO  [SyncThread:3:FileTxnLog@195][] - Creating
new log file: log.100000001
2011-07-12 13:20:59,387 - WARN  [QuorumPeer[myid=1]/127.0.0.5:2181
:Follower@118][] - Got zxid 0x100000001 expected 0x1
2011-07-12 13:20:59,387 - INFO  [SyncThread:1:FileTxnLog@195][] - Creating
new log file: log.100000001
2011-07-12 13:20:59,387 - WARN  [QuorumPeer[myid=2]/127.0.0.6:2181
:Follower@118][] - Got zxid 0x100000001 expected 0x1
If I instead submit the exact same multi-op to the leader instead of a
follower, it succeeds every time.

I'm suspecting there is a bug in how a Multi-op is forwarded between a
follower and a leader (or vice versa). I'm going to be digging into this
tonight, but if anyone has any pointers on where to look, I'd be super
appreciative.

Thanks,
Marshall