Kafka 0.8 Failover Behavior
I wanted to send this out because we saw this in some testing we were doing and wanted to advise the community of something to watch for in 0.8 HA support.
We have a two machine cluster with replication factor 2. We took one machine offline and re-formatted the disk. We re-installed the Kafka software, but did not recreate any of the local disk files. The intention was to simply re-start the broker process, but due to an error in the network config that took some time to diagnose, we ended up with the both machines' brokers down.
When we fixed the network config and restarted the brokers, we happened to start the broker on the rebuilt machine first. The net result was when the healthy broker came back online, the rebuilt machine was already the leader and because of the Zookeeper state, it force the healthy broker to delete all of its topic data, thus wiping out the entire contents of the cluster.
We are instituting operations procedures to safeguard against this scenario in the future (and fortunately we only blew away a test cluster), but this was a bit of a nasty surprise for a Friday.