[
https://issues.apache.org/jira/browse/KAFKA-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luke Chen updated KAFKA-16709:
------------------------------
Fix Version/s: 3.8.0
> alter logDir within broker might cause log cleanup hanging
> ----------------------------------------------------------
>
> Key: KAFKA-16709
> URL: https://issues.apache.org/jira/browse/KAFKA-16709
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.7.0
> Reporter: Luke Chen
> Assignee: Luke Chen
> Priority: Major
> Fix For: 3.8.0
>
>
> When doing alter replica logDirs, we'll create a future log and pause log
> cleaning for the partition(
> [here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/server/ReplicaManager.scala#L1200]).
> And this log cleaning pausing will resume after alter replica logDirs
> completes
> ([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogManager.scala#L1254]).
> And when in the resuming log cleaning, we'll decrement 1 for the
> LogCleaningPaused count. Once the count reached 0, the cleaning pause is
> really resuming.
> ([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogCleanerManager.scala#L310]).
> For more explanation about the logCleaningPaused state can check
> [here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/log/LogCleanerManager.scala#L55].
>
> But, there's still one factor that could increase the LogCleaningPaused
> count: leadership change
> ([here|https://github.com/apache/kafka/blob/643db430a707479c9e87eec1ad67e1d4f43c9268/core/src/main/scala/kafka/server/ReplicaManager.scala#L2126]).
> When there's a leadership change, we'll check if there's a future log in
> this partition, if so, we'll create future log and pauseCleaning
> (LogCleaningPaused count + 1). So, if during the alter replica logDirs:
> # alter replica logDirs for tp0 triggered (LogCleaningPaused count = 1)
> # tp0 leadership changed (LogCleaningPaused count = 2)
> # alter replica logDirs completes, resuming logCleaning (LogCleaningPaused
> count = 1)
> # LogCleaning keeps paused because the count is always > 0
>
> The log cleaning is not just related to compacting logs, but also affecting
> the normal log retention processing, which means, the log retention for these
> paused partitions will be pending. This issue can be fixed when broker
> restarted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)