Good afternoon. We are running kafka on centos linux. I enabled controlled shutdown in the property file. We are starting/stopping kafka using init script. The init script will issue term signal first followed 3 seconds later by kill signal. Is that right process to shutdown kafka? Which startup/shutdown/restart script you guys use? What shutdown process linkedin uses? What side effects could be after kafka service is killed uncleanly using kill -9 signal?
Controlled shutdown takes 2 parameters - number of retries and shutdown timeout. In every retry, controlled shutdown attempts to move leaders off of the broker that needs to be shutdown. If the controlled shutdown runs out of retries, it proceeds to shutting down the broker even if it still hosts a few leaders. At LinkedIn, the script to bounce Kafka brokers waits for the under replicated partition count to drop to 0 before invoking controlled shutdown on the next broker. The aim is to avoid data loss that occurs if you shut down a broker that still has some leaders. If the under replicated count never drops to 0, it indicates a bug in Kafka code and the script does not proceed to bouncing any more brokers in a cluster. We measure the time it takes to move "n" leaders off of some broker, and configure the shutdown timeout accordingly. We also configure the retries to a small number (2 or 3). If the controlled shutdown fails the retries, the broker shuts itself down anyways. In general, you want to avoid hard killing (kill -9) a broker since that means the broker will run a long running log recovery process on startup. That significantly delays the time the broker takes to rejoin the cluster.
Thanks, Neha On Sun, Aug 18, 2013 at 3:33 PM, Vadim Keylis <[EMAIL PROTECTED]> wrote:
It depends on how much flexibility you need during the controlled shutdown and whether you have remote jmx operations enabled in your production Kafka cluster. The jmx controlled shutdown method will offer more flexibility as your script will have the retry logic, you don't need to make config changes to Kafka brokers to change the timeout or the # of retries for controlled shutdown. On the other hand, the jmx controlled shutdown method requires access to remote jmx on the broker. At LinkedIn, we do not have the ability to invoke jmx operations remotely on Kafka brokers in production. So we prefer using the controlled.shutdown.enable method.
Thanks, Neha On Mon, Aug 19, 2013 at 12:34 PM, Vadim Keylis <[EMAIL PROTECTED]>wrote:
Neha. Thanks so much for explaining. That leaves only one open question. How do you validate that shutdown was successful if you do not have remote jmx access unless besides setting timeout reasonable high?
Thanks so much again, Vadim On Mon, Aug 19, 2013 at 9:11 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
The controlled shutdown command proceeds to shutting down the broker after it runs of controlled shutdown retries. Since the shutdown call is blocking, its return will indicate the broker has successfully shut down. If the under replicated partition count drops to 0, that is a good enough indication of a successful broker bounce.
Thanks, Neha On Mon, Aug 19, 2013 at 10:55 PM, Vadim Keylis <[EMAIL PROTECTED]>wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext