Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Controlled shutdown failure, retry settings


Copy link to this message
-
Re: Controlled shutdown failure, retry settings
On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
> It looks like when the controlled shutdown failes with an IOException, the
> exception is swallowed, and we see nothing in the logs:
>
>             catch {
>               case ioe: java.io.IOException =>
>                 channel.disconnect()
>                 channel = null
>                 // ignore and try again
>             }

> Question.....what is the ramification of an 'unclean shutdown'?  Is it no
> different than a shutdown with no controlled shutdown ever attempted?  Or
> is it something more difficult to recover from?

Unclean shutdown could result in data loss - since you are moving
leadership to a replica that has fallen out of ISR. i.e., it's log end
offset is behind the last committed message to this partition.

>
> I am still not clear on how to generate the state transition logs.....Does
> the StateChangeLogManagerTool run against the main logs for the server (and
> just collates entries there)?

Take a look at the packaged log4j.properties file. The controller's
partition/replica state machines and its channel manager which
sends/receives leaderandisr requests/responses to brokers uses a
stateChangeLogger. The replica managers on all brokers also use this
logger.

>
>
> This one eventually succeeds, after a mysterious failure:
>
> 2013-10-25 00:11:53,891  INFO [Thread-13] server.KafkaServer - [Kafka
> Server 10], Starting controlled shutdown
> ....<no exceptions between these log lines><no "Remaining partitions to
> move....">....
> 2013-10-25 00:12:28,965  WARN [Thread-13] server.KafkaServer - [Kafka
> Server 10], Retrying controlled shutdown after the previous attempt
> failed...

Our logging can improve - e.g., it looks like on controller movement
we could retry without saying why.
Thanks,

Joel

>
>
>
> On Fri, Oct 25, 2013 at 12:51 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
>
>> Neha,
>>
>> It looks like the StateChangeLogMergerTool takes state change logs as
>> input.  I'm not sure I know where those live?  (Maybe update the doc on
>> that wiki page to describe!).
>>
>> Thanks,
>>
>> Jason
>>
>>
>> On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
>>
>>> Jason,
>>>
>>> The state change log tool is described here -
>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool
>>>
>>> I'm curious what the IOException is and if we can improve error reporting.
>>> Can you send around the stack trace ?
>>>
>>> Thanks,
>>> Neha
>>>
>>>
>>> On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>> > Ok,
>>> >
>>> > Looking at the controlled shutdown code, it appears that it can fail
>>> with
>>> > an IOException too, in which case it won't report the remaining
>>> partitions
>>> > to replicate, etc.  (I think that might be what I'm seeing, since I
>>> never
>>> > saw the log line for "controlled shutdown failed, X remaining
>>> partitions",
>>> > etc.).  In my case, that may be the issue (it's happening during a
>>> rolling
>>> > restart, and the second of 3 nodes might be trying to shutdown before
>>> the
>>> > first one has completely come back up).
>>> >
>>> > I've heard you guys mention several times now about controller and state
>>> > change logs.  But I don't know where those live (or how to configure).
>>> >  Please advise!
>>> >
>>> > Thanks,
>>> >
>>> > Jason
>>> >
>>> >
>>> > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <
>>> [EMAIL PROTECTED]
>>> > >wrote:
>>> >
>>> > > Controlled shutdown can fail if the cluster has non zero under
>>> replicated
>>> > > partition count. Since that means the leaders may not move off of the
>>> > > broker being shutdown, causing controlled shutdown to fail. The
>>> backoff
>>> > > might help if the under replication is just temporary due to a spike
>>> in
>>> > > traffic. This is the most common reason it might fail besides bugs.
>>> But
>>> > you
>>> > > can check the logs to see why the shutdown failed.

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB