Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - produce request failed: due to Leader not local for partition


Copy link to this message
-
Re: produce request failed: due to Leader not local for partition
Jun Rao 2013-07-01, 04:33
Commented on the jira.

Thanks,

Jun
On Sat, Jun 29, 2013 at 6:21 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:

> I added this scenario to KAFKA-955.
>
> I'm thinking that this scenario could be a problem for ack=0 in general
> (even without controlled shutdown).  If we do an "uncontrolled" shutdown,
> it seems that some topics won't ever know there could have been a leader
> change.  Would it make sense to force a meta-data refresh for all topics on
> a broker, any time an IOException happens on a socket (e.g. "connection
> reset")?  Currently, it looks like only the topic that experiences the
> failure will have a metadata refresh issued for it.
>
> Maybe this should be a separate jira issue, now that I think about it.
>
> Jason
>
>
> On Mon, Jun 24, 2013 at 10:52 PM, Jason Rosenberg <[EMAIL PROTECTED]>
> wrote:
>
> > Also, looking back at my logs, I'm wondering if a producer will reuse the
> > same socket to send data to the same broker, for multiple topics (I'm
> > guessing yes).  In which case, it looks like I'm seeing this scenario:
> >
> > 1. producer1 is happily sending messages for topicX and topicY to serverA
> > (serverA is the leader for both topics, only 1 partition for each topic
> for
> > simplicity).
> > 2. serverA is restarted, and in the process, serverB becomes the new
> > leader for both topicX and topicY.
> > 3. producer1 decides to send a new message to topicX to serverA.
> > 3a. this results in an exception ("Connection reset by peer").
> >  producer1's connection to serverA is invalidated.
> > 3b. producer1 makes a new metadata request for topicX, and learns that
> > serverB is now the leader for topicX.
> > 3c. producer1 resends the message to topicX, on serverB.
> > 4. producer1 decides to send a new message to topicY to serverA.
> > 4a. producer1 notes that it's socket to serverA is invalid, so it creates
> > a new connection to serverA.
> > 4b. producer1 successfully sends it's message to serverA (without
> > realizing that serverA is no longer the leader for topicY).
> > 4c. serverA logs to it's console:
> > 2013-06-23 08:28:46,770  WARN [kafka-request-handler-2] server.KafkaApis
> -
> > [KafkaApi-508818741] Produce request with correlation id 7136261 from
> > client  on partition [mytopic,0] failed due to Leader not local for
> > partition [mytopic,0] on broker 508818741
> > 5. producer1 continues to send messages for topicY to serverA, and
> serverA
> > continues to log the same messages.
> > 6. 10 minutes later, producer1 decides to update it's metadata for
> topicY,
> > and learns that serverB is now the leader for topidY.
> > 7. the warning messages finally stop in the console for serverA.
> >
> > I am pretty sure this scenario, or one very close to it, is what I'm
> > seeing in my logs, after doing a rolling restart, with controlled
> shutdown.
> >
> > Does this scenario make sense?
> >
> > One thing I notice, is that in the steady state, every 10 minutes the
> > producer refreshes it's metadata for all topics.  However, when sending a
> > message to a specific topic fails, only the metadata for that topic is
> > refreshed, even though the ramifications should be that all topics which
> > have the same leader might need to be refreshed, especially in response
> to
> > a "connection reset by peer".
> >
> > Jason
> >
> >
> >
> > On Mon, Jun 24, 2013 at 10:14 PM, Jason Rosenberg <[EMAIL PROTECTED]
> >wrote:
> >
> >> Jun,
> >>
> >> To be clear, this whole discussion was started, because I am clearly
> >> seeing "failed due to Leader not local" on the last broker restarted,
> >> after all the controlled shutting down has completed and all brokers
> >> restarted.
> >>
> >> This leads me to believe that a client made a meta data request and
> found
> >> out that server A was the leader for it's partition, and then server A
> was
> >> restarted, and then the client makes repeated producer requests to
> server
> >> A, without encountering a broken socket.  Thus, I'm not sure it's
> correct
> >> that the socket is invalidated in that case after a restart.