Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Controlled shutdown failure, retry settings


Copy link to this message
-
Re: Controlled shutdown failure, retry settings
Jason,

The state change log tool is described here -
https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool

I'm curious what the IOException is and if we can improve error reporting.
Can you send around the stack trace ?

Thanks,
Neha
On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:

> Ok,
>
> Looking at the controlled shutdown code, it appears that it can fail with
> an IOException too, in which case it won't report the remaining partitions
> to replicate, etc.  (I think that might be what I'm seeing, since I never
> saw the log line for "controlled shutdown failed, X remaining partitions",
> etc.).  In my case, that may be the issue (it's happening during a rolling
> restart, and the second of 3 nodes might be trying to shutdown before the
> first one has completely come back up).
>
> I've heard you guys mention several times now about controller and state
> change logs.  But I don't know where those live (or how to configure).
>  Please advise!
>
> Thanks,
>
> Jason
>
>
> On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <[EMAIL PROTECTED]
> >wrote:
>
> > Controlled shutdown can fail if the cluster has non zero under replicated
> > partition count. Since that means the leaders may not move off of the
> > broker being shutdown, causing controlled shutdown to fail. The backoff
> > might help if the under replication is just temporary due to a spike in
> > traffic. This is the most common reason it might fail besides bugs. But
> you
> > can check the logs to see why the shutdown failed.
> >
> > Thanks,
> > Neha
> > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <[EMAIL PROTECTED]> wrote:
> >
> > > I'm running into an issue where sometimes, the controlled shutdown
> fails
> > to
> > > complete after the default 3 retry attempts.  This ended up in one
> case,
> > > with a broker under going an unclean shutdown, and then it was in a
> > rather
> > > bad state after restart.  Producers would connect to the metadata vip,
> > > still think that this broker was the leader, and then fail on that
> > leader,
> > > and then reconnect to to the metadata vip, and get sent back to that
> same
> > > failed broker!   Does that make sense?
> > >
> > > I'm trying to understand the conditions which cause the controlled
> > shutdown
> > > to fail?  There doesn't seem to be a setting for max amount of time to
> > > wait, etc.
> > >
> > > It would be nice to specify how long to try before giving up (hopefully
> > > giving up in a graceful way).
> > >
> > > Instead, we have a retry count, but it's not clear what this retry
> count
> > is
> > > really specifying, in terms of how long to keep trying, etc.
> > >
> > > Also, what are the ramifications for different settings for the
> > > controlled.shutdown.retry.backoff.ms?  Is there a reason we want to
> wait
> > > before retrying again (again, it would be helpful to understand the
> > reasons
> > > for a controlled shutdown failure).
> > >
> > > Thanks,
> > >
> > > Jason
> > >
> >
>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB