Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka, mail # dev - [jira] [Commented] (KAFKA-881) Kafka broker not respecting log.roll.hours


+
Dan F 2013-05-01, 06:08
Copy link to this message
-
[jira] [Commented] (KAFKA-881) Kafka broker not respecting log.roll.hours
"Jun Rao 2013-05-03, 15:58

    [ https://issues.apache.org/jira/browse/KAFKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648519#comment-13648519 ]

Jun Rao commented on KAFKA-881:
-------------------------------

If we "don't reuse files after a restart", it may cause fragmentation of the log files since it will affect all logs, whether their retention is time-based (and how long the retention time is) or not. The second option seems better. Since the major development is on 0.8 now, I suggest that we patch this in trunk, instead of 0.7.
                
> Kafka broker not respecting log.roll.hours
> ------------------------------------------
>
>                 Key: KAFKA-881
>                 URL: https://issues.apache.org/jira/browse/KAFKA-881
>             Project: Kafka
>          Issue Type: Bug
>          Components: log
>    Affects Versions: 0.7.2
>            Reporter: Dan F
>            Assignee: Jay Kreps
>
> We are running Kafka 0.7.2. We set log.roll.hours=1. I hoped that meant logs would be rolled every hour, or more. Only, sometimes logs that are many hours (sometimes days) old have more data added to them. This perturbs our systems for reasons I won't get in to.
> I don't know Scala or Kafka well, but I have proposal for why this might happen: upon restart, a broker forgets when its log files have been appended to ("firstAppendTime"). Then a potentially infinite amount of time later, the restarted broker receives another message for the particular (topic, partition), and starts the clock again. It will then roll over that log after an hour.
> https://svn.apache.org/repos/asf/kafka/branches/0.7/core/src/main/scala/kafka/server/KafkaConfig.scala says:
>   /* the maximum time before a new log segment is rolled out */
>   val logRollHours = Utils.getIntInRange(props, "log.roll.hours", 24*7, (1, Int.MaxValue))
> https://svn.apache.org/repos/asf/kafka/branches/0.7/core/src/main/scala/kafka/log/Log.scala has maybeRoll, which needs segment.firstAppendTime defined. It also has updateFirstAppendTime() which says if it's empty, then set it.
> If my hypothesis is correct about why it is happening, here is a case where rolling is longer than an hour, even on a high volume topic:
> - write to a topic for 20 minutes
> - restart the broker
> - wait for 5 days
> - write to a topic for 20 minutes
> - restart the broker
> - write to a topic for an hour
> The rollover time was now 5 days, 1 hour, 40 minutes. You can make it as long as you want.
> Proposed solution:
> The very easiest thing to do would be to have Kafka re-initialized firstAppendTime with the file creation time. Unfortunately, there is no file creation time in UNIX. There is ctime, change time, updated when a file's inode information is changed.
> One solution is to embed the firstAppendTime in the filename (say, seconds since epoch). Then when you open it you could reset firstAppendTime to exactly what it really was. This ignores clock drift or resetting. One could set firstAppendTime to min(filename-based time, current time).
> A second solution is to make the Kafka log roll over at specific times, regardless of when the file was created. Conceptually, time can be divided into windows of size log.rollover.hours since epoch (UNIX time 0, 1970). So, when firstAppendTime is empty, compute the next rollover time (say, next = (hours since epoch) % (log.rollover.hours) + log.rollover.hours). If the file mtime (last modified) is before the current rollover window ( (next-log.rollover.hours) .. next ), roll it over right away. Otherwise, roll over when you cross next, and reset next.
> A third solution (not perfect, but an approximation at least) would be to not to write to a segment if firstAppendTime is not defined and the timestamp on the file is more than log.roll.hours old.
> There are probably other solutions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

 
+
Dan F 2013-05-08, 00:27
+
Dan F 2013-05-08, 22:41
+
Jun Rao 2013-05-30, 04:05
+
Sam Meder 2013-08-01, 19:47