[
https://issues.apache.org/jira/browse/KAFKA-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17846853#comment-17846853
]
Lin Siyuan commented on KAFKA-16779:
------------------------------------
I'm very sorry,Nicholas Feinberg. I misinterpreted the description. I've rolled
back.
> Kafka retains logs past specified retention
> -------------------------------------------
>
> Key: KAFKA-16779
> URL: https://issues.apache.org/jira/browse/KAFKA-16779
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.7.0
> Reporter: Nicholas Feinberg
> Priority: Major
> Labels: expiration, retention
> Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz,
> kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz,
> state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz
>
>
> In a Kafka cluster with all topics set to four days of retention or longer
> (345600000ms), most brokers seem to be retaining six days of data.
> This is true even for topics which have high throughput (500MB/s, 50k msgs/s)
> and thus are regularly rolling new log segments. We observe this unexpectedly
> high retention both via disk usage statistics and by requesting the oldest
> available messages from Kafka.
> Some of these brokers crashed with an 'mmap failed' error (attached). When
> those brokers started up again, they returned to the expected four days of
> retention.
> Manually restarting brokers also seems to cause them to return to four days
> of retention. Demoting and promoting brokers only has this effect on a small
> part of the data hosted on a broker.
> These hosts had ~170GiB of free memory available. We saw no signs of pressure
> on either system or JVM heap memory before or after they reported this error.
> Committed memory seems to be around 10%, so this doesn't seem to be an
> overcommit issue.
> This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th).
> Prior to the upgrade, it was running on Kafka 2.4.
> We last reduced retention for ops on May 7th, after which we restored
> retention to our default of four days. This was the second time we've
> temporarily reduced and restored retention since the upgrade. This problem
> did not manifest the previous time we did so, nor did it manifest on our
> other Kafka 3.7 clusters.
> We are running on AWS
> [d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We
> have 23 brokers, each with 24 disks. We're running in a JBOD configuration
> (i.e. unraided).
> Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD,
> we're still using Zookeeper.
> Sample broker logs are attached. The 05-12 and 05-14 logs are from separate
> hosts. Please let me know if I can provide any further information.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)