We are using mirrormaker to replicate data between two kafka clusters. I am seeing huge difference in size of log in data dir between the broker in source cluster vs broker in destination cluster:
For eg: Size of ~/data/Topic-0/ is about 910 G in source broker, but only its only 25G in destination broker. I see segmented log files (~500 M) is created for about every 2 or 3 mins in source brokers, but I see segmented log files is created for about every 25 mins in destination broker.
I verified mirrormaker is doing fine using consumer offset checker, not much lag, offsets are incrementing. I also verified that topics/partitions are not under replicated in both source and target cluster. What is the reason for this difference in disk usage? Thanks, Raja.
Ah, one thing to be aware of is that the effectiveness of compression is directly related to the producer batch size--more batching, more compression. So even if you use compression on both clusters the mirror may be much smaller.
On Friday, August 23, 2013, Rajasekar Elango wrote:
We are currently working on the following JIRA to avoid decompress and re-compress at MirrorMaker, when this is done, the size of the logs on source and target clusters should be the same as long as the batch size of the mirror maker producer is the same as the batch size of the source producer:
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext