Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_


Copy link to this message
-
Re: Fatal issue (was RE: 0.8 throwing exception "Failed to find leader" and high-level consumer fails to make progress_
"Hargett, Phil" 2013-07-30, 16:09
Oh, we're building from source multiple times per week, either until 0.8 comes out of beta or we ourselves slide towards production.  :)

Depending on where the builds were done (Dev vs official), we have commits 76d3905 or b1891e7. Both are more recent than beta 1, I believe.

:)

On Jul 30, 2013, at 12:01 PM, "Jun Rao" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

What's the revision of the 0.8 branch that you used? If that's older than the beta1 release, I recommend that you upgrade.

Thanks,

Jun
On Tue, Jul 30, 2013 at 3:09 AM, Hargett, Phil <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
No, sorry, it didn't take 90 seconds to connect to ZK (at least I hope not). I had my consumer open for 90 secs in this case before shutting it down and disposing of it—hence any races caused by fast startup/shutdown should not have been relevant.

I build from source off of the 0.8 branch, so isn't that pretty close to beta 1?

:)

On Jul 30, 2013, at 12:22 AM, "Jun Rao" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>> wrote:

Hmm, it takes 90 secs to connect to ZK? That seems way too long. Is your ZK healthy.

Also, are you on the 0.8 beta1 release? If not, could you try that one? It may not be related, but we did fix some consumer side deadlock issues there.

Thanks,

Jun
On Mon, Jul 29, 2013 at 9:02 AM, Hargett, Phil <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>> wrote:
I think we have 3 different classes in play here:

 * kafka.consumer.ZookeeperConsumerConnector
 * kafka.utils.ShutdownableThread
 * kafka.consumer.ConsumerFetcherManager

The actual consumer is the first one, and it does a fair amount of work *before* stopping the fetcher—which then results in shutting down the leader thread

If the initial connectZk method in ZookeeperConsumerConnector takes a long time (more than 90 seconds in 1 case this morning), then I could see the fetcher's stopConnections method not getting called, because there isn't a ConsumerFetcherManager instance yet.

In other words, we could be shutting down the consumer before it is fully initialized—but that doesn't seem correct, as the code in ZookeeperConsumerConnector is synchronous—my application wouldn't have a reference to a partially initialized consumer.

Odd.

:)

On Jul 29, 2013, at 11:22 AM, "Jun Rao" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>>> wrote:

There seems to be two separate issues.

1. Why do you see NullPointerException in the leaderFinder thread? I am not sure what's causing this. In the normal path, when a consumer connector is shut down (this is when the pointer is set to null), it first waits for the leaderFinder thread to shut down. Do you think that you can provide a test case that reproduces this and attach it to the jira?

2. It seems that you have lots of consumer rebalances. This is good to avoid since it can slow down the consumption.

Thanks,

Jun
On Mon, Jul 29, 2013 at 6:21 AM, Hargett, Phil <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>>> wrote:
Why would a consumer that has been shutdown still be rebalancing?

Zookeeper session timeout (zookeeper.session.timeout.ms<http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms><http://zookeeper.session.timeout.ms>) is 1000 and sync time (zookeeper.sync.timeout.ms<http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms><http://zookeeper.sync.timeout.ms>) is 500.

Also, the timeout for the thread that looks for the leader is left at the default 200 milliseconds (refresh.leader.backoff.ms<http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms><http://refresh.leader.backoff.ms>). That's why we see these messages so often in our logs.

I can imagine I need to tune some of these settings for load...but the issue, I think, is that the consumer has been shutdown, so the ZkClient for the leader finder thread no longer has a connection—and won't.
On Jul 28, 2013, at 11:21 PM, "Jun Rao" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>>>> wrote:

Ok. So, it seems that the issue is there are lots of rebalances in the consumer. How long did you set the zk session expiration time? A typical reason for many rebalances is the consumer side GC. If so, you will see Zookeeper session expirations in the consumer log (grep for Expired). Occasional rebalances are fine. Too many rebalances can slow down the consumption and one will need to tune the java GC setting.

Thanks,

Jun
On Sat, Jul 27, 2013 at 9:33 AM, Hargett, Phil <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>>>> wrote:
All bugs are relative, aren't th