[
https://issues.apache.org/jira/browse/KAFKA-19548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haozhong Ma updated KAFKA-19548:
--------------------------------
Description: In our production environment, we encountered a scenario where
a broker failed to start due to checkpoint creation failure on a single disk
(caused by disk corruption or filesystem errors). According to Kafka's design,
such disk-level failures should be isolated via {{{}logDirFailureChannel{}}},
allowing other healthy disks to continue serving traffic. However, upon
reviewing the {{CheckpointFileWithFailureHandler}} implementation, we observed
that while methods like {{{}write{}}}, {{{}read{}}}, and {{writeIfDirExists}}
handle {{IOException}} by routing the affected {{log.dir}} to
{{{}logDirFailureChannel{}}}, the checkpoint initialization process lacks this
fault-tolerant behavior. Should checkpoint creation adopt the same
failure-handling logic? If this is not an intentional design, I will submit a
PR to fix this issue. (was: In our production environment, we encountered a
scenario where a broker failed to start due to checkpoint creation failure on a
single disk (caused by disk corruption or filesystem errors). According to
Kafka's design, such disk-level failures should be isolated via
{{{}logDirFailureChannel{}}}, allowing other healthy disks to continue serving
traffic. However, upon reviewing the {{CheckpointFileWithFailureHandler}}
implementation, we observed that while methods like {{{}write{}}},
{{{}read{}}}, and {{writeIfDirExists}} handle {{IOException}} by routing the
affected {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint
initialization process lacks this fault-tolerant behavior. Is this an
oversight? Should checkpoint creation adopt the same failure-handling logic?)
> Broker Startup: Handle Checkpoint Creation Failure via logDirFailureChannel
> ---------------------------------------------------------------------------
>
> Key: KAFKA-19548
> URL: https://issues.apache.org/jira/browse/KAFKA-19548
> Project: Kafka
> Issue Type: Improvement
> Components: core
> Reporter: Haozhong Ma
> Assignee: Haozhong Ma
> Priority: Major
>
> In our production environment, we encountered a scenario where a broker
> failed to start due to checkpoint creation failure on a single disk (caused
> by disk corruption or filesystem errors). According to Kafka's design, such
> disk-level failures should be isolated via {{{}logDirFailureChannel{}}},
> allowing other healthy disks to continue serving traffic. However, upon
> reviewing the {{CheckpointFileWithFailureHandler}} implementation, we
> observed that while methods like {{{}write{}}}, {{{}read{}}}, and
> {{writeIfDirExists}} handle {{IOException}} by routing the affected
> {{log.dir}} to {{{}logDirFailureChannel{}}}, the checkpoint initialization
> process lacks this fault-tolerant behavior. Should checkpoint creation adopt
> the same failure-handling logic? If this is not an intentional design, I will
> submit a PR to fix this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)