Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Controlled shutdown failure, retry settings


Copy link to this message
-
Re: Controlled shutdown failure, retry settings
On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
> It looks like when the controlled shutdown failes with an IOException, the
> exception is swallowed, and we see nothing in the logs:
>
>             catch {
>               case ioe: java.io.IOException =>
>                 channel.disconnect()
>                 channel = null
>                 // ignore and try again
>             }

> Question.....what is the ramification of an 'unclean shutdown'?  Is it no
> different than a shutdown with no controlled shutdown ever attempted?  Or
> is it something more difficult to recover from?

Unclean shutdown could result in data loss - since you are moving
leadership to a replica that has fallen out of ISR. i.e., it's log end
offset is behind the last committed message to this partition.

>
> I am still not clear on how to generate the state transition logs.....Does
> the StateChangeLogManagerTool run against the main logs for the server (and
> just collates entries there)?

Take a look at the packaged log4j.properties file. The controller's
partition/replica state machines and its channel manager which
sends/receives leaderandisr requests/responses to brokers uses a
stateChangeLogger. The replica managers on all brokers also use this
logger.

>
>
> This one eventually succeeds, after a mysterious failure:
>
> 2013-10-25 00:11:53,891  INFO [Thread-13] server.KafkaServer - [Kafka
> Server 10], Starting controlled shutdown
> ....<no exceptions between these log lines><no "Remaining partitions to
> move....">....
> 2013-10-25 00:12:28,965  WARN [Thread-13] server.KafkaServer - [Kafka
> Server 10], Retrying controlled shutdown after the previous attempt
> failed...

Our logging can improve - e.g., it looks like on controller movement
we could retry without saying why.
Thanks,

Joel

>
>
>
> On Fri, Oct 25, 2013 at 12:51 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
>
>> Neha,
>>
>> It looks like the StateChangeLogMergerTool takes state change logs as
>> input.  I'm not sure I know where those live?  (Maybe update the doc on
>> that wiki page to describe!).
>>
>> Thanks,
>>
>> Jason
>>
>>
>> On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
>>
>>> Jason,
>>>
>>> The state change log tool is described here -
>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool
>>>
>>> I'm curious what the IOException is and if we can improve error reporting.
>>> Can you send around the stack trace ?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>> > Ok,
>>> >
>>> > Looking at the controlled shutdown code, it appears that it can fail
>>> with
>>> > an IOException too, in which case it won't report the remaining
>>> partitions
>>> > to replicate, etc.  (I think that might be what I'm seeing, since I
>>> never
>>> > saw the log line for "controlled shutdown failed, X remaining
>>> partitions",
>>> > etc.).  In my case, that may be the issue (it's happening during a
>>> rolling
>>> > restart, and the second of 3 nodes might be trying to shutdown before
>>> the
>>> > first one has completely come back up).
>>> >
>>> > I've heard you guys mention several times now about controller and state
>>> > change logs.  But I don't know where those live (or how to configure).
>>> >  Please advise!
>>> >
>>> > Thanks,
>>> >
>>> > Jason
>>> >
>>> >
>>> > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <
>>> [EMAIL PROTECTED]
>>> > >wrote:
>>> >
>>> > > Controlled shutdown can fail if the cluster has non zero under
>>> replicated
>>> > > partition count. Since that means the leaders may not move off of the
>>> > > broker being shutdown, causing controlled shutdown to fail. The
>>> backoff
>>> > > might help if the under replication is just temporary due to a spike
>>> in
>>> > > traffic. This is the most common reason it might fail besides bugs.
>>> But
>>> > you
>>> > > can check the logs to see why the shutdown failed.