-confirm expected HA behavior
Arpit Gupta 2014-02-05, 22:28
I have a scenario where i am trying to test how HDFS HA works in case of network issues. I used iptables to block requests to the rpc port 8020 in order to simulate that. Below is the some info on what i did.
NN1 - Active
NN2 - Standby
Using iptables stop port 8020 on NN1 (http://stackoverflow.com/questions/7423309/iptables-block-access-to-port-8000-except-from-ip-address)
iptables -A INPUT -p tcp --dport 8020 -j DROP
NN2 transitions to active.
Run the following command to allow requests to port 8020 (http://stackoverflow.com/questions/10197405/iptables-remove-specific-rules)
iptables -D INPUT -p tcp --dport 8020 -j DROP
After this NN1 shut itself down with
2014-02-05 01:00:38,030 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(354)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [IP:8485], stream=QuorumOutputStream starting at txid 568))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 1/1. 1 exceptions thrown:
126.96.36.199:8485: IPC's epoch 1 is less than the last promised epoch 2
NN1 in this case shuts down with the above exception as it still believes its active hence there is an exception when talking to JN's. Thus the operators would have restart NN1 which could take a while based on the image size. Hence i was wondering if there is a better way to handle the above case where we may be transition to standby if exceptions like above are seen.
Wanted to get thoughts of others before i opened a an enhancement request.
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.