Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> produce request failed: due to Leader not local for partition


Copy link to this message
-
Re: produce request failed: due to Leader not local for partition
Jun,

To be clear, this whole discussion was started, because I am clearly seeing
"failed due to Leader not local" on the last broker restarted, after all
the controlled shutting down has completed and all brokers restarted.

This leads me to believe that a client made a meta data request and found
out that server A was the leader for it's partition, and then server A was
restarted, and then the client makes repeated producer requests to server
A, without encountering a broken socket.  Thus, I'm not sure it's correct
that the socket is invalidated in that case after a restart.

Alternatively, could it be that the client (which sends messages to
multiple topics), gets metadata updates for multiple topics, but doesn't
attempt to send a message to topicX until after the leader has changed and
server A has been restarted.  In this case, if it's the first time the
producer sends to topicX, does it only then create a new socket?

Jason
On Mon, Jun 24, 2013 at 10:00 PM, Jun Rao <[EMAIL PROTECTED]> wrote:

> That should be fine since the old socket in the producer will no longer be
> usable after a broker is restarted.
>
> Thanks,
>
> Jun
>
>
> On Mon, Jun 24, 2013 at 9:50 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
>
> > What about a non-controlled shutdown, and a restart, but the producer
> never
> > attempts to send anything during the time the broker was down?  That
> could
> > have caused a leader change, but without the producer knowing to refresh
> > it's metadata, no?
> >
> >
> > On Mon, Jun 24, 2013 at 9:05 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
> >
> > > Other than controlled shutdown, the only other case that can cause the
> > > leader to change when the underlying broker is alive is when the broker
> > > expires its ZK session (likely due to GC), which should be rare. That
> > being
> > > said, forwarding in the broker may not be a bad idea. Could you file a
> > jira
> > > to track this?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Mon, Jun 24, 2013 at 2:50 PM, Jason Rosenberg <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Yeah,
> > > >
> > > > I see that with ack=0, the producer will be in a bad state anytime
> the
> > > > leader for it's partition has changed, while the broker that it
> thinks
> > is
> > > > the leader is still up.  So this is a problem in general, not only
> for
> > > > controlled shutdown, but even for the case where you've restarted a
> > > server
> > > > (without controlled shutdown), which in and of itself can force a
> > leader
> > > > change.  If the producer doesn't attempt to send a message during the
> > > time
> > > > the broker was down, it will never get a connection failure, and
> never
> > > get
> > > > fresh metadata, and subsequently start sending messages to the
> > > non-leader.
> > > >
> > > > Thus, I'd say this is a problem with ack=0, regardless of controlled
> > > > shutdown.  Any time there's a leader change, the producer will send
> > > > messages into the ether.  I think this is actually a severe
> condition,
> > > that
> > > > could be considered a bug.  How hard would it be to have the
> receiving
> > > > broker forward on to the leader, in this case?
> > > >
> > > > Jason
> > > >
> > > >
> > > > On Mon, Jun 24, 2013 at 8:44 AM, Joel Koshy <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > I think Jason was suggesting quiescent time as a possibility only
> if
> > > the
> > > > > broker did request forwarding if it is not the leader.
> > > > >
> > > > > On Monday, June 24, 2013, Jun Rao wrote:
> > > > >
> > > > > > Jason,
> > > > > >
> > > > > > The quiescence time that you proposed won't work. The reason is
> > that
> > > > with
> > > > > > ack=0, the producer starts losing data silently from the moment
> the
> > > > > leader
> > > > > > is moved (by controlled shutdown) until the broker is shut down.
> > So,
> > > > the
> > > > > > sooner that you can shut down the broker, the better. What we
> > > realized
> > > > is
> > > > > > that if you can use a larger batch size, ack=1 can still deliver

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB