Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - produce request failed: due to Leader not local for partition


Copy link to this message
-
Re: produce request failed: due to Leader not local for partition
Jason Rosenberg 2013-06-25, 05:25
Filed https://issues.apache.org/jira/browse/KAFKA-955
On Mon, Jun 24, 2013 at 10:14 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:

> Jun,
>
> To be clear, this whole discussion was started, because I am clearly
> seeing "failed due to Leader not local" on the last broker restarted,
> after all the controlled shutting down has completed and all brokers
> restarted.
>
> This leads me to believe that a client made a meta data request and found
> out that server A was the leader for it's partition, and then server A was
> restarted, and then the client makes repeated producer requests to server
> A, without encountering a broken socket.  Thus, I'm not sure it's correct
> that the socket is invalidated in that case after a restart.
>
> Alternatively, could it be that the client (which sends messages to
> multiple topics), gets metadata updates for multiple topics, but doesn't
> attempt to send a message to topicX until after the leader has changed and
> server A has been restarted.  In this case, if it's the first time the
> producer sends to topicX, does it only then create a new socket?
>
> Jason
>
>
> On Mon, Jun 24, 2013 at 10:00 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
>
>> That should be fine since the old socket in the producer will no longer be
>> usable after a broker is restarted.
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Mon, Jun 24, 2013 at 9:50 PM, Jason Rosenberg <[EMAIL PROTECTED]>
>> wrote:
>>
>> > What about a non-controlled shutdown, and a restart, but the producer
>> never
>> > attempts to send anything during the time the broker was down?  That
>> could
>> > have caused a leader change, but without the producer knowing to refresh
>> > it's metadata, no?
>> >
>> >
>> > On Mon, Jun 24, 2013 at 9:05 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
>> >
>> > > Other than controlled shutdown, the only other case that can cause the
>> > > leader to change when the underlying broker is alive is when the
>> broker
>> > > expires its ZK session (likely due to GC), which should be rare. That
>> > being
>> > > said, forwarding in the broker may not be a bad idea. Could you file a
>> > jira
>> > > to track this?
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > > On Mon, Jun 24, 2013 at 2:50 PM, Jason Rosenberg <[EMAIL PROTECTED]>
>> > wrote:
>> > >
>> > > > Yeah,
>> > > >
>> > > > I see that with ack=0, the producer will be in a bad state anytime
>> the
>> > > > leader for it's partition has changed, while the broker that it
>> thinks
>> > is
>> > > > the leader is still up.  So this is a problem in general, not only
>> for
>> > > > controlled shutdown, but even for the case where you've restarted a
>> > > server
>> > > > (without controlled shutdown), which in and of itself can force a
>> > leader
>> > > > change.  If the producer doesn't attempt to send a message during
>> the
>> > > time
>> > > > the broker was down, it will never get a connection failure, and
>> never
>> > > get
>> > > > fresh metadata, and subsequently start sending messages to the
>> > > non-leader.
>> > > >
>> > > > Thus, I'd say this is a problem with ack=0, regardless of controlled
>> > > > shutdown.  Any time there's a leader change, the producer will send
>> > > > messages into the ether.  I think this is actually a severe
>> condition,
>> > > that
>> > > > could be considered a bug.  How hard would it be to have the
>> receiving
>> > > > broker forward on to the leader, in this case?
>> > > >
>> > > > Jason
>> > > >
>> > > >
>> > > > On Mon, Jun 24, 2013 at 8:44 AM, Joel Koshy <[EMAIL PROTECTED]>
>> > wrote:
>> > > >
>> > > > > I think Jason was suggesting quiescent time as a possibility only
>> if
>> > > the
>> > > > > broker did request forwarding if it is not the leader.
>> > > > >
>> > > > > On Monday, June 24, 2013, Jun Rao wrote:
>> > > > >
>> > > > > > Jason,
>> > > > > >
>> > > > > > The quiescence time that you proposed won't work. The reason is
>> > that
>> > > > with
>> > > > > > ack=0, the producer starts losing data silently from the moment