Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Controlled shutdown failure, retry settings


Copy link to this message
-
Re: Controlled shutdown failure, retry settings
Jason Rosenberg 2013-10-25, 15:26
Ok,

Looking at the controlled shutdown code, it appears that it can fail with
an IOException too, in which case it won't report the remaining partitions
to replicate, etc.  (I think that might be what I'm seeing, since I never
saw the log line for "controlled shutdown failed, X remaining partitions",
etc.).  In my case, that may be the issue (it's happening during a rolling
restart, and the second of 3 nodes might be trying to shutdown before the
first one has completely come back up).

I've heard you guys mention several times now about controller and state
change logs.  But I don't know where those live (or how to configure).
 Please advise!

Thanks,

Jason
On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <[EMAIL PROTECTED]>wrote:

> Controlled shutdown can fail if the cluster has non zero under replicated
> partition count. Since that means the leaders may not move off of the
> broker being shutdown, causing controlled shutdown to fail. The backoff
> might help if the under replication is just temporary due to a spike in
> traffic. This is the most common reason it might fail besides bugs. But you
> can check the logs to see why the shutdown failed.
>
> Thanks,
> Neha
> On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <[EMAIL PROTECTED]> wrote:
>
> > I'm running into an issue where sometimes, the controlled shutdown fails
> to
> > complete after the default 3 retry attempts.  This ended up in one case,
> > with a broker under going an unclean shutdown, and then it was in a
> rather
> > bad state after restart.  Producers would connect to the metadata vip,
> > still think that this broker was the leader, and then fail on that
> leader,
> > and then reconnect to to the metadata vip, and get sent back to that same
> > failed broker!   Does that make sense?
> >
> > I'm trying to understand the conditions which cause the controlled
> shutdown
> > to fail?  There doesn't seem to be a setting for max amount of time to
> > wait, etc.
> >
> > It would be nice to specify how long to try before giving up (hopefully
> > giving up in a graceful way).
> >
> > Instead, we have a retry count, but it's not clear what this retry count
> is
> > really specifying, in terms of how long to keep trying, etc.
> >
> > Also, what are the ramifications for different settings for the
> > controlled.shutdown.retry.backoff.ms?  Is there a reason we want to wait
> > before retrying again (again, it would be helpful to understand the
> reasons
> > for a controlled shutdown failure).
> >
> > Thanks,
> >
> > Jason
> >
>