Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Controlled shutdown failure, retry settings


Copy link to this message
-
Re: Controlled shutdown failure, retry settings
Jason Rosenberg 2013-10-25, 22:22
It looks like when the controlled shutdown failes with an IOException, the
exception is swallowed, and we see nothing in the logs:

            catch {
              case ioe: java.io.IOException =>
                channel.disconnect()
                channel = null
                // ignore and try again
            }

So, I don't really have visibility into why controlled shutdown fails when
it does.....Below are some logging examples.

The first one eventually succeeds, after a mysterious error.  The second
one exhibits the issue suggested, ("no other replicas in ISR").

So, I'll try adding more retries and more back off delay.....

Question.....what is the ramification of an 'unclean shutdown'?  Is it no
different than a shutdown with no controlled shutdown ever attempted?  Or
is it something more difficult to recover from?

I am still not clear on how to generate the state transition logs.....Does
the StateChangeLogManagerTool run against the main logs for the server (and
just collates entries there)?
This one eventually succeeds, after a mysterious failure:

2013-10-25 00:11:53,891  INFO [Thread-13] server.KafkaServer - [Kafka
Server 10], Starting controlled shutdown
....<no exceptions between these log lines><no "Remaining partitions to
move....">....
2013-10-25 00:12:28,965  WARN [Thread-13] server.KafkaServer - [Kafka
Server 10], Retrying controlled shutdown after the previous attempt
failed...
....
2013-10-25 00:12:56,623  INFO [Thread-13] server.KafkaServer - [Kafka
Server 10], Controlled shutdown succeeded
This one fails ultimately, and proceeds with unclean shutdown:

2013-10-25 20:39:10,350  INFO [Thread-12] server.KafkaServer - [Kafka
Server 11], Starting controlled shutdown
...
<lots of exceptions like this><no "Remaining partitions to move....">....
2013-10-25 20:39:40,735 ERROR [kafka-request-handler-4] change.logger -
Controller 11 epoch 187 encountered error while electing leader for
partition [topicX,0] due to: No other replicas in ISR 11 for [topicX,0]
besides current leader 11 and shutting down brokers 11.
2013-10-25 20:39:40,735 ERROR [kafka-request-handler-4] change.logger -
Controller 11 epoch 187 initiated state change for partition [topicX,0]
from OnlinePartition to OnlinePartition failed
kafka.common.StateChangeFailedException: encountered error while electing
leader for partition [topicX,0] due to: No other replicas in ISR 11 for
[topicX,0] besides current leader 11 and shutting down brokers 11.
        at
kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:328)
        at
kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:155)
        at
kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:111)
        at
kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:110)
        at scala.collection.immutable.Set$Set1.foreach(Set.scala:81)
        at
kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:110)
        at
kafka.controller.KafkaController$$anonfun$shutdownBroker$4$$anonfun$apply$2.apply(KafkaController.scala:188)
        at
kafka.controller.KafkaController$$anonfun$shutdownBroker$4$$anonfun$apply$2.apply(KafkaController.scala:184)
        at scala.Option.foreach(Option.scala:121)
        at
kafka.controller.KafkaController$$anonfun$shutdownBroker$4.apply(KafkaController.scala:184)
        at
kafka.controller.KafkaController$$anonfun$shutdownBroker$4.apply(KafkaController.scala:180)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:57)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:43)
        at
kafka.controller.KafkaController.shutdownBroker(KafkaController.scala:180)
        at
kafka.server.KafkaApis.handleControlledShutdownRequest(KafkaApis.scala:133)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:72)
        at
kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42)
        at java.lang.Thread.run(Thread.java:662)
Caused by: kafka.common.StateChangeFailedException: No other replicas in
ISR 11 for [topicX,0] besides current leader 11 and shutting down brokers 11
        at
kafka.controller.ControlledShutdownLeaderSelector.selectLeader(PartitionLeaderSelector.scala:177)
        at
kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:304)
        ... 17 more
...
2013-10-25 20:39:45,404  WARN [Thread-12] server.KafkaServer - [Kafka
Server 11], Retrying controlled shutdown after the previous attempt
failed...
<lots more of the StateChangeFailedExceptions><no "Remaining partitions to
move....">....
2013-10-25 20:40:20,499  WARN [Thread-12] server.KafkaServer - [Kafka
Server 11], Retrying controlled shutdown after the previous attempt
failed...
<lots more of the StateChangeFailedExceptions><no "Remaining partitions to
move....">....
2013-10-25 20:40:55,598  WARN [Thread-12] server.KafkaServer - [Kafka
Server 11], Retrying controlled shutdown after the previous attempt
failed...
2013-10-25 20:40:55,598  WARN [Thread-12] server.KafkaServer - [Kafka
Server 11], Proceeding to do an unclean shutdown as all the controlled
shutdown attempts failed
So, this would seem to indicate the issue described previously (no leader
for partition so unclean shutdown).....

So, I'll try adding more retries and more back off delay.....

Question.....what is the ramification of an 'unclean shutdown'?  Is it no
different than a shutdown with no controlled shutdown ever attempted?  Or
is it something more difficult to recover from?
On Fri, Oct 25, 2013 at 12:51 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote: