Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # dev >> [jira] [Commented] (KAFKA-1106) HighwaterMarkCheckpoint failure puting broker into a bad state

Copy link to this message
[jira] [Commented] (KAFKA-1106) HighwaterMarkCheckpoint failure puting broker into a bad state

    [ https://issues.apache.org/jira/browse/KAFKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13808761#comment-13808761 ]

Jay Kreps commented on KAFKA-1106:

Yeah a corrupted offset file would lead to this (but could also be some other bug). We do shut down the broker on any I/O error (as that means we don't know the state of the data on disk and need to run recovery). Do you have the log from that previous shutdown?

If the offset checkpoint is corrupt I think the desired behavior is for the node to crash. So in that case I think the problem is that we throw that number format exception which we probably don't handle right instead of IOException which would cause the broker to shoot itself in the head.

Let's do this: I'll fix the parsing logic on trunk so that any unparsable file throws IOException. This will let us gracefully handle corruption in the file. I'm still not convinced that this is a file corruption thing and not just some bug in our code, but without the actual file it's a little hard to know. If you can reproduce it on another machine that proves it is a bug--if so grab the file, I suspect it will give a clue what is going on.

> HighwaterMarkCheckpoint failure puting broker into a bad state
> --------------------------------------------------------------
>                 Key: KAFKA-1106
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1106
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8
>            Reporter: David Lao
>         Attachments: KAFKA-1106-patch, kafka.log
> I'm encountering a case where broker get stuck due to HighwaterMarkCheckpoint failing to recover from reading what appear to be corrupted isr entries. Once in this state, leader election can never succeed and hence stalling the entire cluster.
> Please see the detailed stack trace from the attached log.  Perhaps failing fast when HighwaterMarkCheckpoint fails to read would force the broker to restart and recover.  

This message was sent by Atlassian JIRA