i have been playing with high availability using journalnodes and 2 masters
both running namenode and hbase master.
when i kill the namenode and hbase-master processes on the active master,
the failover is perfect. hbase never stops and a running map-reduce jobs
keeps going. this is impressive!
however when instead of killing the proceses i kill the entire active
master machine, the transactions is less smooth and can take a long time,
at least it seems this way in the logs. this is because ssh fencing fails
but keeps trying. my fencing is configured as:
it is unclear to me if the transition in this case is also rapid but the
fencing takes long while the new namenode is already active, or if in this
period i am stuck without an active namenode. it is hard to accurately test
this in my setup.
is this supposed to take this long? is HDFS writable in this period? and is
hbase supposed to survive this long transition?