I've been researching Kafka for our requirements and am trying to figure
out the best way to implement multi-region failover (lowest complexity).
One requirement we have is that the offsets of the backup must match the
primary. As I understand it, MirrorMaker does not (currently) guarantee
that the target Kafka instance will have the same log offsets as the source
Kafka instance. Our message processing pipeline will be strictly relying
on topic-broker-partition-offset to avoid re-processing messages.
Here's what I'm leaning towards, please share any crticism or thoughts:
- Two regions, Region1 (primary) and Region2 (backup)
- Region2 must have the same offsets per topic-broker-partition-offset state
- A few minutes of lost messages can be tolerated if Region1 is ever lost.
- That it would be a mistake to attempt Kafka replication across regions
and maintain a Zookeeper cluster across regions (because they weren't
designed for the higher latency and link-loss issues and that there could
be operational edge case bugs we won't catch/understand, etc)
- That Region1 has multiple topics, brokers, partitions, replicas and a
Zookeeper cluster. Only Region1 is in use operationally (gets all producer
and consumer traffic).
- That Region2 has the same configuration but receives no operational
traffic (no producers, no consumers) but gets periodic rsync from Region1
- If Region1 is lost, we will start Kafka in Region2, it should startup at
the appropriate offset (from last rysnc backup). Producers will be
instructed to use Region2.
- Region2 is now the new primary Kafka instance until we decide to switch
back to Region1.
This is quite simple and there is more data loss than I'd like, but the
loss would be acceptable for our use case, considering the loss of Region1
should be a rare event (if ever).
1. Do you see any pitfalls or better ways to proceed? It seems this Kafka
feature request would be a better solution (adding a MirrorMaker mode to
maintain offsets https://issues.apache.org/jira/browse/KAFKA-658
) one day.
2. What is the Rsync backup is interrupted when Region1 is lost? Is there
the possibility the 2nd Kafka instance could be left in an un-workable
state? For example, if a .log file is copied, but the corresponding .index
is not completed. Can the .index file be re-created? It appears it can in