It's not ideal - right now we use the JMX operation (which returns an
empty set on a successful controlled shutdown). If not it returns a
set containing the partitions still being led on the broker. We retry
(with appropriate intervals) until it succeeds. After that we do a
regular broker shutdown (SIGTERM). i.e., it is currently not automated
and it does take a while (an hour or so) to do a rolling bounce across
a 16 node cluster with a few hundred topics. We could also use the
inbuilt controlled shutdown feature on the broker to do the same thing
- which is also better because the JMX port is not always open to
remote access in production environments.

It is possible to automate it to some degree - and if controlled
shutdown fails after 'n' retries the automation policy could be to
proceed with the unclean shutdown or abort and wait for manual
intervention. Another issue is that when a broker is taken down there
will be underreplicated partitions in the cluster. When the broker
comes back up we should (ideally) wait until the underreplicated
partition count goes back down to zero before proceeding to the next
broker - otherwise that broker could take longer to do its controlled
shutdown (since it needs to move leadership of partitions it leads to
other replicas which would not be possible if the other replica is the
broker that just came up). We currently don't have an easy way to
integrate this seamlessly with the deployment system at Linkedin.


On Wed, Jul 10, 2013 at 9:48 PM, Vadim Keylis <[EMAIL PROTECTED]> wrote:

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB