Martin Dickson created KAFKA-17562:
--------------------------------------
Summary: Failure detection for degraded brokers
Key: KAFKA-17562
URL: https://issues.apache.org/jira/browse/KAFKA-17562
Project: Kafka
Issue Type: Improvement
Components: core, replication
Reporter: Martin Dickson
Follow on from [this mailing list
discussion|[https://lists.apache.org/thread/z8xn2dm1zm3clymhh60hf7rzgw286k8q].]
When a leader for a partition becomes degraded but does not fully fail it can
remove all follower replicas from ISR. This can happen solely due to a problem
with the leader (slow disk, degraded network, ...), and hence a single failure
can make the partition unavailable for writes (assuming min.insync.replicas=2).
If the leader then fully fails the partition goes offline, which introduces
data loss risks during recovery.
The recovery options will improve substantially with KIP-966 (again assuming
min.insync.replicas=2), but we there is still a gap around failure detection.
In particular, KIP-966 alone doesn't help with the case when the broker is
degraded but does not fully fail for a long period of time.
Currently Kafka failure detection is based on whether the broker can maintain
its connection with the metadata quorum. The suggestion here is to consider
more comprehensive failure detection, which could be handled by demoting
leadership rather than fully fencing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)