We recently increased number of partitions from 4 to 16 and after that console producer mostly fails with LeaderNotAvailableException and exits after 3 tries:
Here is last few lines of console producer log:
No partition metadata for topic test-41 due to kafka.common.LeaderNotAvailableException}] for topic [test-41]: class kafka.common.LeaderNotAvailableException (kafka.producer.BrokerPartitionInfo) [2013-08-27 08:29:30,271] ERROR Failed to collate messages by topic, partition due to: Failed to fetch topic metadata for topic: test-41 (kafka.producer.async.DefaultEventHandler) [2013-08-27 08:29:30,271] INFO Back off for 100 ms before retrying send. Remaining retries = 0 (kafka.producer.async.DefaultEventHandler) [2013-08-27 08:29:30,372] INFO Secure sockets for data transfer is enabled (kafka.producer.SyncProducerConfig) [2013-08-27 08:29:30,372] INFO Fetching metadata from broker id:0,host:localhost,port:6667,secure:true with correlation id 8 for 1 topic(s) Set(test-41) (kafka.client.ClientUtils$) [2013-08-27 08:29:30,373] INFO begin ssl handshake for localhost/ 127.0.0.1:6667//127.0.0.1:36640 (kafka.security.SSLSocketChannel) [2013-08-27 08:29:30,375] INFO finished ssl handshake for localhost/ 127.0.0.1:6667//127.0.0.1:36640 (kafka.security.SSLSocketChannel) [2013-08-27 08:29:30,375] INFO Connected to localhost:6667:true for producing (kafka.producer.SyncProducer) [2013-08-27 08:29:30,380] INFO Disconnecting from localhost:6667:true (kafka.producer.SyncProducer) [2013-08-27 08:29:30,381] INFO Secure sockets for data transfer is enabled (kafka.producer.SyncProducerConfig) [2013-08-27 08:29:30,381] ERROR Failed to send requests for topics test-41 with correlation ids in [0,8] (kafka.producer.async.DefaultEventHandler) kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries. at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) at kafka.producer.Producer.send(Producer.scala:74) at kafka.producer.ConsoleProducer$.main(ConsoleProducer.scala:168) at kafka.producer.ConsoleProducer.main(ConsoleProducer.scala) [2013-08-27 08:29:30,383] INFO Shutting down producer (kafka.producer.Producer) [2013-08-27 08:29:30,384] INFO Closing all sync producers (kafka.producer.ProducerPool) Also, this happens only for new topics (we have auto.create.topic set to true), If retry sending message to existing topic, it works fine. Is there any tweaking I need to do to broker or to producer to scale based on number of partitions?
In 0.8 producers keep a cache of the partition -> leader_broker_id map which is used to determine to which brokers should the messages be sent. After new partitions are added, the cache on the producer has not populated yet hence it will throw this exception. The producer will then try to refresh its cache by asking the brokers "who are the leaders of these new partitions that I do not know of before". The brokers at the beginning also do not know this information, and will only get this information from controller which will only propagation the leader information after the leader elections have all been finished.
If you set num.retries to 3 then it is possible that producer gives up too soon before the leader info ever propagated to producers, hence to producers also. Could you try to increase producer.num.retries and see if the producer can eventually succeed in re-trying?
Guozhang On Tue, Aug 27, 2013 at 8:53 AM, Rajasekar Elango <[EMAIL PROTECTED]>wrote:
As Guozhang said, your producer might give up sooner than the leader election completes for the new topic. To confirm if your producer gave up too soon, you can run the state change log merge tool for this topic and see when the leader election finished for all partitions
./bin/kafka-run-class.sh kafka.tools.StateChangeLogMerger --logs <location to all state change logs> --topic <topic>
Note that this tool requires you to give the state change logs for all brokers in the cluster. Thanks, Neha On Tue, Aug 27, 2013 at 9:45 AM, Guozhang Wang <[EMAIL PROTECTED]> wrote:
When I ran StateChangeLogMerger, I am seeing this message repeated 16 times for each partition:
[2013-08-27 12:30:02,535] INFO [ReplicaFetcherManager on broker 1] Removing fetcher for partition [test-60,13] (kafka.server.ReplicaFetcherManager) [2013-08-27 12:30:02,536] INFO [Log Manager on Broker 1] Created log for partition [test-60,13] in /home/relango/dev/mandm/kafka/main/target/dist/mandm-kafka/kafka-data. (kafka.log.LogManager)
I am also seeing .log and .index files created for this topic in data dir. Also list topic command shows leaders, replicas and isrs for all partitions. Do you still think increasing num of retries would help or is it some other issue..? Also console Producer doesn't seem to have option to set num of retries. Is there a way to configure num of retries for console producer ?
Thanks, Raja. On Tue, Aug 27, 2013 at 12:52 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote: Thanks, Raja.
The remove fetcher log entry is normal under addition of partitions, since they indicate that some leader changes have happened so brokers are closing the fetchers to the old leaders.
I just realized that the console Producer does not have the message.send.max.retries options yet. Could you file a JIRA for this and I will followup to add this option? As for now you can hard modify the default value from 3 to a larger number.
Guozhang On Tue, Aug 27, 2013 at 12:37 PM, Rajasekar Elango <[EMAIL PROTECTED]>wrote:
Thanks Guozhang, Changing max retry to 5 worked. Since I am changing console producer code, I can also submit patch adding both message.send.max.retries and retry.backoff.ms to console producer. Can you let me know process for submitting patch?
Thanks, Raja. On Tue, Aug 27, 2013 at 4:03 PM, Guozhang Wang <[EMAIL PROTECTED]> wrote: Thanks, Raja.
*The documentation says I need to work off of trunk. Can you confirm If I should be working in trunk or different branch.* * * *Thanks,* *Raja.* On Tue, Aug 27, 2013 at 8:33 PM, Guozhang Wang <[EMAIL PROTECTED]> wrote: Thanks, Raja.
We are trying to minimize the number of patches in 0.8 to critical bug fixes or broken tooling. If the patch involves significant code changes, we would encourage taking it on trunk. If you want to just fix the console producer to take the retry argument, I would think it is small enough to consider taking it on 0.8 branch since it affects the usability of the console producer.
Thanks, Neha On Wed, Aug 28, 2013 at 8:36 AM, Rajasekar Elango <[EMAIL PROTECTED]>wrote:
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext