Hi
Here is what i'm doing :
NN1 (active) + ZKFC1
NN2 (standby) + ZKFC2
First I stop the ZKFC1 service =>
NN1 (standby)
NN2 (active) + ZKFC2
Then I kill the active node : kill -9 on NN2 process
NN1 stay on standby
ZKFC2 log :
2012-11-22 22:23:40,073 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
2012-11-22 22:23:40,081 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old
node exists:
0a096d79636c757374657212036e6e321a106e733233363833342e6f76682e6e657420d43e28d33e
2012-11-22 22:23:40,082 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at /nn2:8020
2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at /nn2:8020 to standby state without
fencing
2012-11-22 22:23:40,205 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
local node is the most recent active...
2012-11-22 22:23:40,233 INFO org.apache.hadoop.ha.ZKFailoverController:
Trying to make NameNode at xxxx/nn1:8020 active...
2012-11-22 22:23:40,605 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at xxxx/nn1:8020 to active state
2012-11-22 22:24:14,073 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Failed on local exception: java.io.IOException: Response is
null.; Host Details : local host is: "xxxx/nn1"; destination host is:
"xxxx":8020;
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.HealthMonitor: Entering
state SERVICE_NOT_RESPONDING
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverController:
Local service NameNode at xxxx/nn1:8020 entered state:
SERVICE_NOT_RESPONDING
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ZKFailoverController:
Quitting master election for NameNode at xxxx/nn1:8020 and marking that
fencing is necessary
2012-11-22 22:24:14,074 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2012-11-22 22:24:14,128 INFO org.apache.zookeeper.ZooKeeper: Session:
0x23b29574aed0014 closed
2012-11-22 22:24:14,128 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Ignoring stale result from old client with sessionId 0x23b29574aed0014
2012-11-22 22:24:14,128 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2012-11-22 22:24:16,129 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:16,130 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:
http://wiki.apache.org/hadoop/ConnectionRefused2012-11-22 22:24:18,131 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:18,131 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:
http://wiki.apache.org/hadoop/ConnectionRefused2012-11-22 22:24:20,133 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:20,133 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:
http://wiki.apache.org/hadoop/ConnectionRefused2012-11-22 22:24:22,135 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: xxxx/nn1:8020. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1 SECONDS)
2012-11-22 22:24:22,136 WARN org.apache.hadoop.ha.HealthMonitor:
Transport-level exception trying to monitor health of NameNode at
xxxx/nn1:8020: Call From xxxx/nn1 to xxxx:8020 failed on connection
exception: java.net.ConnectException: Connection refused; For more details
see:
http://wiki.apache.org/hadoop/ConnectionRefused...
NN1 logs :
2012-11-22 22:23:40,109 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services
started for active state
2012-11-22 22:23:40,109 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 166
2012-11-22 22:23:40,110 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2
Total time for transactions(ms): 0Number of transactions batched in Syncs:
0 Number of syncs: 1 SyncTimes(ms): 32 125
2012-11-22 22:23:40,182 INFO
org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2
Total time for transactions(ms): 0Number of transactions batched in Syncs:
0 Number of syncs: 2 SyncTimes(ms): 85 144
2012-11-22 22:23:40,196 INFO
org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits
file /home/hdfs/dfs/name/current/edits_inprogress_0000000000000000166 ->
/home/hdfs/dfs/name/current/edits_0000000000000000166-0000000000000000167
2012-11-22 22:23:40,196 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services
required for standby state
2012-11-22 22:23:40,198 INFO
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Will roll logs on
active node at /nn2:8020 every 120 seconds.
2012-11-22 22:23:40,199 INFO
org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Starting
standby checkpoint thread...
Checkpointing active NN at nn2:50070
Serving checkpoints at xxxx/nn1:50070
2012-11-22 22:25:40,235 INFO
org.apache.hadoop.h