Kafka, mail # user - Complex multi-datacenter setups - 2013-07-12, 00:19
Solr & Elasticsearch trainings in New York & San Francisco [more info][hide]
 Search Hadoop and all its subprojects:

Switch to Plain View
Copy link to this message
Complex multi-datacenter setups
Hi all,

I was wondering if anybody here has and was willing to share experience
about designing and operating complex multi-datacenter/multi-cluster
Kafka deployments in which data must flow from and to several distinct
Kafka clusters with more complex semantics than what MirrorMaker

The general, very sensible consensus is that producers of data should
publish to a local Kafka cluster. But if that data is produced in
multiple datacenters, and must be consumed multiple datacenters as well,
then you need to implement data routing and filtering to organise your

Imagine the following scenario, with three datacenters A, B and C. Data
is being produced (of the same kind, to the same topic) in all three
datacenters. Both datacenters A and B have consumers that want all the
data generated in all three datacenters, but C is only interested in a
subset of what is produced in A and B (according to some specific
filters for example).

This means you have data flowing in both directions between each
datacenter. You need some kind of source-base filtering to prevent data
going back and forth ad vitam eternam, as well as something to implement
the custom filtering logic where needed, which also means you'd need to
envelope all data into a broader object that knows about where the data
was published from.

Is this kind of deployment pretty common in the industry/among the users
of Kafka? I haven't found much online that would help putting together
this type of architectures. Is it basically roll-your-own with something
similar to the MirrorMaker that has a consumer, filtering component and
producer, and place a couple of these in each direction between each
pair of clusters?

It ultimately bogs down to pretty simple "routing" of data, just in a
more complex manner than having all data flow to a single sink location.
Let me know what you folks think!

Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595

Jun Rao 2013-07-12, 04:11
Maxime Petazzoni 2013-07-12, 16:30
Jun Rao 2013-07-12, 16:37
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB