Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Kafka 0.8 Failover Behavior


Copy link to this message
-
Re: Kafka 0.8 Failover Behavior
Jun Rao 2013-06-28, 06:21
The broker still in ISR in ZK has all committed data.

Thanks,

Jun
On Thu, Jun 27, 2013 at 5:04 PM, Vadim Keylis <[EMAIL PROTECTED]> wrote:

> Jun,
> Does kafka provides ability to configure broker to be in in-sync before
> become availalble?
> Is it possible in case of all brokers crash to find out which node has the
> most recent data to initiate proper startup procedure?
>
> Thanks,
> Vadim
>
>
> On Fri, Jun 21, 2013 at 8:24 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
>
> > Hi, Bob,
> >
> > Thanks for reporting this. Yes, this is the current behavior when all
> > brokers fail. Whichever broker comes back first becomes the new leader
> and
> > is the source of truth. This increases availability. However, previously
> > committed data can be lost. This is what we call unclean leader
> elections.
> > Another option is instead to wait until a broker in in-sync replica set
> to
> > come back before electing a new leader. This will preserve all committed
> > data at the expense of availability. The application can configure the
> > system with the appropriate option based on its need.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Fri, Jun 21, 2013 at 4:08 PM, Bob Jervis <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > > I wanted to send this out because we saw this in some testing we were
> > > doing and wanted to advise the community of something to watch for in
> 0.8
> > > HA support.
> > >
> > > We have a two machine cluster with replication factor 2.  We took one
> > > machine offline and re-formatted the disk.  We re-installed the Kafka
> > > software, but did not recreate any of the local disk files.  The
> > intention
> > > was to simply re-start the broker process, but due to an error in the
> > > network config that took some time to diagnose, we ended up with the
> both
> > > machines' brokers down.
> > >
> > > When we fixed the network config and restarted the brokers, we happened
> > to
> > > start the broker on the rebuilt machine first.  The net result was when
> > the
> > > healthy broker came back online, the rebuilt machine was already the
> > leader
> > > and because of the Zookeeper state, it force the healthy broker to
> delete
> > > all of its topic data, thus wiping out the entire contents of the
> > cluster.
> > >
> > > We are instituting operations procedures to safeguard against this
> > > scenario in the future (and fortunately we only blew away a test
> > cluster),
> > > but this was a bit of a nasty surprise for a Friday.
> > >
> > > Bob Jervis
> > > Visibletechnologies
> > >
> > >
> >
>