We have a cluster of 3 0.8 brokers, and this morning one of the broker crashed. It is a test broker, and we stored the logs in /tmp/kafka-logs. All topics in use are replicated on the three brokers.
You can guess the problem, when the broker rebooted it wiped all the data in the logs.
The producers and consumers are fine, but the broker with the wiped data keeps generating a lot of exceptions, and I don't really know what to do to recover.
[2014-01-03 10:09:47,755] ERROR [KafkaApi-1] Error when processing fetch request for partition [topic,0] offset 814798 from consumer with correlation id 0 (kafka.server.KafkaApis) kafka.common.OffsetOutOfRangeException: Request for offset 814798 but we only have log segments in the range 0 to 19372.
There are a lot of them, something like 10+ per second. I (maybe wrongly) assumed that the broker would catch up, if that's the case how can I see the progress ?
In general, what is the recommended way to bring back a broker with wiped data in a cluster ?
If a broker crashes and restarts, it will catch up the missing data from the leader replicas. Normally, when this broker is catching up, it won't be serving any client requests though. Are you seeing those errors on the crashed broker? Also, you are not supposed to see OffsetOutOfRangeException with just one broker failure with 3 replicas. Do you see the following in the controller log?
"No broker in ISR is alive for ... There's potential data loss."
On Fri, Jan 3, 2014 at 1:23 AM, Vincent Rischmann <[EMAIL PROTECTED]>wrote:
My controller.log file only contains the following:
[2014-01-03 09:41:01,794] INFO [ControllerEpochListener on 1]: Initialized controller epoch to 11 and zk version 10 (kafka.controller.ControllerEpochListener) [2014-01-03 09:41:01,812] INFO [Controller 1]: Controller starting up (kafka.controller.KafkaController) [2014-01-03 09:41:02,082] INFO [Controller 1]: Controller startup complete (kafka.controller.KafkaController)
Since friday, nothing has changed and the broker generated multiples gigabytes of traces in server.log, one of the last exception looks like this:
Request for offset 787449 but we only have log segments in the range 0 to 163110.
The range has increased since friday (it was "0 to 19372"), does this mean the broker is actually catching up ? Thanks for your help. 2014/1/3 Jun Rao <[EMAIL PROTECTED]>
The topic causing problems is "clicks", and the partitions requested on the crashed broker are 0 and 2. Given the output of list topic, this means that those 2 partitions are permanently lost right now, right ?
I thought all partitions were replicated, just like for the topic 'stats.live.test', but apparently I screwed up when creating the topics, I should have check that first.
Thanks for your help. 2014/1/6 Jun Rao <[EMAIL PROTECTED]>
If you want replication, you need to specify the replication factor in default.replication.factor for auto created topics or when creating topics manually.
Jun On Tue, Jan 7, 2014 at 1:17 AM, Vincent Rischmann <[EMAIL PROTECTED]>wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext