In our broker logs, we occasionally see errors like this:
2013-12-23 05:02:08,456 ERROR [kafka-request-handler-2] server.KafkaApis - [KafkaApi-45] Error when processing fetch request for partition [mytopic,0] offset 204243601 from consumer with correlation id 130341 kafka.common.OffsetOutOfRangeException: Request for offset 204243601 but we only have log segments in the range 204343397 to 207423640.
I assume this means there's a consumer that has fallen behind consuming messages, and the log retention policy has removed messages before they could be consumed by the consumer.
However, I'm not 100% which consumer it is, and it looks like the only info we have is the correlation id of the consumer, e.g.:
"from consumer with correlation id 130341"
Is there a way to know which consumer this refers to? It seems there are far more correlation id's than there are consumers. Would it be possible to provide a bit more descriptive error message here, so we can immediately know which consumer is falling behind?
We do see a corresponding entry in the consumer logs too:
2013-12-23 05:02:08,797 WARN [ConsumerFetcherThread-myconsumergroup-1387353494862-7aa0c61d-0-45] consumer.ConsumerFetcherThread - [ConsumerFetcherThread-myconsumergroup-1387353494862-7aa0c61d-0-45], Current offset 204243601 for partition [mytopic,0] out of range; reset offset to 204343397
But it would be nice to be able to also use the broker log to quickly find consumers with issues.
Also, I'm not sure, why is logging the event as an ERROR in the broker, but a WARN in the consumer?
So, sometimes I just get the WARN from the ConsumerFetcherThread (as previously noted, above), e.g.:
2014-01-08 02:31:47,394 WARN [ConsumerFetcherThread-myconsumerapp-11] consumer.ConsumerFetcherThread - [ConsumerFetcherThread-myconsumerapp-11], Current offset 16163904970 for partition [mypartition,0] out of range; reset offset to 16175326044
More recently, I see these in the following log line (not sure why I didn't see it previously), coming from the ConsumerIterator:
2014-01-08 02:31:47,681 ERROR [myconsumerthread-0] consumer.ConsumerIterator - consumed offset: 16163904970 doesn't match fetch offset: 16175326044 for mytopic:0: fetched offset = 16175330598: consumed offset = 16163904970; Consumer may lose data
Why would I not see this second ERROR everytime there's a corresponding WARN on the FetcherThread for an offset reset?
Should I only be concerned about possible lost data if I see the second ERROR log line?
On Tue, Dec 24, 2013 at 3:49 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
I'm not sure I understand your question, wrt produced data?
But yes, in general, I believe the consumer is not keeping up with the broker's deleting the data. So it's trying to fetch the next batch of data, but it's last offset is no longer there, etc. So that's the reason for the WARN message, in the fetcher thread.
I'm just not sure I understand then why we don't always see the ConsumerIterator error also, because won't there always be missing data detected there? Why sometimes and not always? What's the difference?
Jason On Wed, Jan 8, 2014 at 12:07 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
Normally, if the consumer can't keep up, you should just see the OffsetOutOfRangeException warning. The offset mismatch error should never happen. It could be that OffsetOutOfRangeException exposed a bug. Do you think you can reproduce this easily?
Jun On Tue, Jan 7, 2014 at 9:29 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
Yes, it's happening continuously, at the moment (although I'm expecting the consumer to catch up soon)....
It seemed to start happening after I refactored the consumer app to use multiple consumer connectors in the same process (each one has a separate topic filter, so should be no overlap between them). All using the same consumer group.
Could it be a thread safety issue in the ZookeeperConsumerConnector (seems unlikely).
Jason On Wed, Jan 8, 2014 at 1:04 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
thinking/typing out loud here not sure if this is the problem but could be so figure I throw it out there
the ZookeeperConsumerConnector has a messageStreamCreated atomic boolean stopping more than one consumer connector being created for messages stream by filter at once...
do you have separate instantiated objects of your ConsumerConnector (so calling ConsumerConnector.create for each one) or do you have one ConsumerConnector.create and then you call createMessageStreams multiple times on it (which would cause a RuntimeException so probably not but maybe your code is swallowing that)... if the latter that could explain why it is lagging behind unexpectedly since really one ConsumerConnector is running....
again, you might not be doing this but figure I throw it out there in case you were.
/******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/ On Wed, Jan 8, 2014 at 1:44 PM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
The consumption rate is a little better after the refactoring. The main issue though, was that we had a mismatch between large and small topics. A large topic can lag, and adversely affect consumption of other topics, so this is an attempt to isolate topic filtering, and better balance the consumers for the different topics.
So, it's definitely working on that score.
The topic that was lagging (and getting OffsetOutOfRangeExceptions) was doing that before and after the refactor (and after we started also seeing the ERROR logging). But consumption of all other topics is working better now (almost no lag at all).
I'm also setting the client.id for each consumer in the process, so that I can see the individual metrics per consumer.
Jason On Thu, Jan 9, 2014 at 1:00 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
well, not currently, as we don't have multiple partitions for the topics.....but yes, I understand that would help too....
but, we are using this multiple consumers within a process approach in general with much success so far......just was curious about this ERROR I was seeing :) On Fri, Jan 10, 2014 at 11:06 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
Not sure, but I'll try (it's a bit difficult to create a test-case, because it requires a good bit of integration testing, etc.).
Jason On Sat, Jan 11, 2014 at 12:06 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext