[
https://issues.apache.org/jira/browse/KAFKA-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haozhong Ma updated KAFKA-19548:
--------------------------------
Description: In our production environment, we encountered a scenario where
a broker failed to start due to checkpoint creation failure on a single disk
(caused by disk corruption or filesystem errors). According to Kafka's design,
such disk-level failures should be isolated via {{{}logDirFailureChannel{}}},
allowing other healthy disks to continue serving traffic. However, upon
reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed
that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}}
handle {{IOException}} by routing the affected {{log.dir}} to
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this
fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt
the same failure-handling logic? (was: In our production environment, we
encountered a scenario where a broker failed to start due to checkpoint
creation failure on a single disk (caused by disk corruption or filesystem
errors). According to Kafka's design, such disk-level failures should be
isolated via {{{}logDirFailureChannel{}}}, allowing other healthy disks to
continue serving traffic. However, upon reviewing the
{{CheckpointFileWithFailureHandler}} implementation, we observed that while
methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}} handle
{{IOException}} by routing the affected {{log.dir}} to
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this
fault-tolerant behavior. Is this an oversight? Should checkpoint creation adopt
the same failure-handling logic?
!image-2025-07-25-15-07-18-919.png!)
> Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel
> ---------------------------------------------------------------------------
>
> Key: KAFKA-19548
> URL: https://issues.apache.org/jira/browse/KAFKA-19548
> Project: Kafka
> Issue Type: Improvement
> Components: core
> Reporter: Haozhong Ma
> Assignee: Haozhong Ma
> Priority: Major
>
> In our production environment, we encountered a scenario where a broker
> failed to start due to checkpoint creation failure on a single disk (caused
> by disk corruption or filesystem errors). According to Kafka's design, such
> disk-level failures should be isolated via {{{}logDirFailureChannel{}}},
> allowing other healthy disks to continue serving traffic. However, upon
> reviewing the {{CheckpointFileWithFailureHandler}} implementation, we
> observed that while methods like {{{}write{}}}, {{{}read{}}}, and
> {{writeIfDirExists}} handle {{IOException}} by routing the affected
> {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization
> process lacks this fault-tolerant behavior. Is this an oversight? Should
> checkpoint creation adopt the same failure-handling logic?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)