Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_


Copy link to this message
-
Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_
Ok. So, it seems that the issue is there are lots of rebalances in the
consumer. How long did you set the zk session expiration time? A typical
reason for many rebalances is the consumer side GC. If so, you will see
Zookeeper session expirations in the consumer log (grep for Expired).
Occasional rebalances are fine. Too many rebalances can slow down the
consumption and one will need to tune the java GC setting.

Thanks,

Jun
On Sat, Jul 27, 2013 at 9:33 AM, Hargett, Phil <
[EMAIL PROTECTED]> wrote:

> All bugs are relative, aren't they? :)
>
> Well, since this thread attempts to rebalance every 200 milliseconds,
> these messages REALLY fill up a log and fast.
>
> Because this error results in so much log output, it makes it difficult to
> find other actionable error messages in the log.
>
> Yes, I could suppress messages from that class (we use log4j after all)
> but I am uncomfortable 1) hiding a thread leak, 2) hiding other possible
>  errors from the same class.
>
> I filed this as KAFKA 989 (IIRC), as I did not see an obvious bug that
> covers it.
>
> This error also happens in less than 1 day of use: most of our systems in
> this category are up for 2-3 months before a software upgrade or other
> event causes us to cycle the process.
>
> I'm sure you have uptime and scaling requirements far beyond ours. So I
> hope these reasons don't seem too petty. :)
>
>
> On Jul 27, 2013, at 12:24 AM, "Jun Rao" <[EMAIL PROTECTED]<mailto:
> [EMAIL PROTECTED]>> wrote:
>
> Other than those exceptions, what issues do you see in your consumer?
>
> Thanks,
>
> Jun
>
>
> On Fri, Jul 26, 2013 at 9:24 AM, Hargett, Phil <
> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
> wrote:
> This NOT a harmless race.
>
> Now my QA teammate is encountering this issue under load. The result of it
> is a background thread that is spinning in a loop that always hits a
> NullPointerException.
>
> I have implemented a variety of assurances in my application code to
> ensure that the high-level consumer I'm spinning up in Java stays alive for
> at least 10 seconds before being asked to shutdown.  Yet the issue still
> persists.
>
> Suggestions?
> ________________________________________
> From: Jun Rao [[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
> Sent: Tuesday, June 25, 2013 11:58 PM
> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
> Subject: Re: 0.8 throwing exception "Failed to find leader" and high-level
> consumer fails to make progress
>
> The exception is likely due to a race condition btw the logic in ZK watcher
> and the closing of ZK connection. It's harmless, except for the weird
> exception.
>
> Thanks,
>
> Jun
>
>
> On Tue, Jun 25, 2013 at 10:07 AM, Hargett, Phil <
> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
> wrote:
>
> > Possibly.
> >
> > I see evidence that its being stopped / started every 30 seconds in same
> > cases (due to my code). It's entirely possible that I have a race, too,
> in
> > that 2 separate pieces of code could be triggering such a stop / start.
> >
> > Gives me something to track down. Thank you!!
> >
> > On Jun 25, 2013, at 12:45 PM, "Jun Rao" <[EMAIL PROTECTED]<mailto:
> [EMAIL PROTECTED]>> wrote:
> >
> > > This typically only happens when the consumerConnector is shut down.
> Are
> > > you restarting the consumerConnector often?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Tue, Jun 25, 2013 at 9:40 AM, Hargett, Phil <
> > > [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
> wrote:
> > >
> > >> Seeing this exception a LOT (3-4 times per second, same log topic).
> > >>
> > >> I'm using external code to feed data to about 50 different log topics
> > over
> > >> a cluster of 3 Kafka 0.8 brokers.  There are 3 ZooKeeper instances as
> > well,
> > >> all of this is running on EC2.  My application creates a high-level
> > >> consumer (1 per topic) to consumer data from each and do further
> > processing.
> > >>
> > >> The problem is this exception is in the high-level consumer, so my

 
+
Hargett, Phil 2013-07-29, 13:22
+
Jun Rao 2013-07-30, 04:22
+
Hargett, Phil 2013-07-30, 10:10
+
Jun Rao 2013-07-30, 16:02
+
Hargett, Phil 2013-07-30, 16:09
+
Hargett, Phil 2013-07-30, 16:34
+
Jun Rao 2013-07-31, 04:17