I'm running a 0.8.0 Kafka cluster of 3 servers. The service that it is for is not in full production yet, so the data written to cluster is minimal (seems to average between 100kb/s -> 300kb/s per server). I have configured Kafka to have a 3 replicas. I am noticing that each Kafka server is talking to all the others at a data rate of 40MB/s for each server (so, a total of 80MB/s for each server). This communication is constant.
Is this normal? This seems like very strange behavior and I'm not exactly sure how to debug.
For each partition the follower will also fetch data from the leader replica, even if there is no new data in the leader replicas.
One thing you can try to increase replica.fetch.wait.max.ms (default value 500ms) so that the followers's fetching request frequency to the leader can be reduced, and see if that has some effect on the traffic.
Guozhang On Tue, Feb 4, 2014 at 8:46 PM, Carl Lerche <[EMAIL PROTECTED]> wrote:
Checking twice a second (using 500ms default) for new data shouldn't cause high network traffic (that should be like < 1KB of overhead). I don't think that explains things. Is it possible that setting has been overridden?
-Jay On Tue, Feb 4, 2014 at 9:25 PM, Guozhang Wang <[EMAIL PROTECTED]> wrote:
So, I tried enabling debug logging, I also made some tweaks to the config (which I probably shouldn't have) and craziness happened.
First, some more context. Besides the very high network traffic, we were seeing some other issues that we were not focusing on yet.
* Even though the log retention was set to 50GB & 24 hours, data logs were getting cleaned up far quicker quicker. I'm not entirely sure how much quicker, but there was definitely far less than 12 hours and 1GB of data.
* Kafka was not properly balanced. We had 3 servers, and only 2 of them were partition leaders. One server was a replica for all partitions. We tried to run a rebalance command, but it did not work. We were going to investigate later.
So, after restarting all the kafkas, something happened with the offsets. The offsets that our consumers had no longer existed. It looks like somehow all the contents was lost? The logs show many exceptions like:
`Request for offset 770354 but we only have log segments in the range 759234 to 759838.`
So, I reset all the consumer offsets to the head of the queue as I did not know of anything better to do. Once the dust settled, all the issues we were seeing vanished. Communication between Kafka nodes appear to be normal, Kafka was able to rebalance, and hopefully log retention will be normal.
I am unsure what happened or how to get more debug information.
On Wed, Feb 5, 2014 at 12:31 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
On Wed, Feb 05, 2014 at 04:51:16PM -0800, Carl Lerche wrote:
Were any of the brokers down for an extended period? If the preferred replica election command failed it could be because the preferred replica was catching up (which could explain the higher than expected network traffic). Do you monitor the under-replicated partitions count on your cluster? If you have that data it could help confirm this.
So, the "good news" is that the problem came back again. The bad news is that I disabled debug logs as it was filling disk (and I had other fires to put out). I will re-enable debug logs and wait for it to happen again.
On Thu, Feb 6, 2014 at 4:05 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote:
So, if you start from scratch (new environment and download of the Kafka release), could you post the list of steps to reproduce this issue? On Thu, Feb 6, 2014 at 7:48 PM, Carl Lerche <[EMAIL PROTECTED]> wrote:
Re: While I have not explicitly checked this, I would say that the answer is no. The network is over 1Gbps and I have setup monitoring for disk space and nothing out of the norm is happening there. The expected data is on the order of 500 kbits per sec.
On Thu, Feb 6, 2014 at 9:06 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
One last thing, I have collected a snippet of the network traffic between Kafka instances using tcpdump. However, it contains some customer data and less than a minutes worth was over 1 GB, so I can't really post it here, but I could possibly share offline if it can help debug the issue.
On Thu, Feb 6, 2014 at 11:44 PM, Carl Lerche <[EMAIL PROTECTED]> wrote:
/******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/ On Fri, Feb 7, 2014 at 2:47 AM, Carl Lerche <[EMAIL PROTECTED]> wrote:
Those periods with "no traffic" actually are periods of expected traffic between nodes. It's just that the off period is so high that the normal traffic is not visible. Also, once traffic goes crazy, the only way to reset it is to stop all kafka nodes (vs do a rolling restart).
I have been running the kafka nodes in different AWS AZs so the bandwidth is costing me. For now, I have temporarily moved to a single Kafka node. Once I can start collecting metrics, I will attempt to reproduce the issue.
On Fri, Feb 7, 2014 at 5:42 AM, Joe Stein <[EMAIL PROTECTED]> wrote:
From: "Jun Rao"; Date: 2014年2月14日(星期五) 晚上11:54 To: "[EMAIL PROTECTED]"; Subject: Re: Surprisingly high network traffic between kafka servers
Thanks for sharing this. We probably should add a sanity check in the broker to make sure that replica.fetch.max.bytes >= message.max.bytes. Could you file a jira for that?
Jun On Thu, Feb 13, 2014 at 8:01 PM, zhong dong <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext