Gaurav Narula created KAFKA-16234:
-------------------------------------
Summary: Log directory failure re-creates partitions in another
logdir automatically
Key: KAFKA-16234
URL: https://issues.apache.org/jira/browse/KAFKA-16234
Project: Kafka
Issue Type: Bug
Components: jbod
Affects Versions: 3.7.0
Reporter: Gaurav Narula
With [KAFKA-16157|https://github.com/apache/kafka/pull/15263] we made changes
in {{HostedPartition.Offline}} enum variant to embed {{Partition}} object.
Further, {{ReplicaManager::getOrCreatePartition}} tries to compare the old and
new topicIds to decide if it needs to create a new log.
The getter for `Partition::topicId` relies on retrieving the topicId from
{{log}} field or {{{{logManager.currentLogs}}. The former is set to {{None}}
when a partition is marked offline and the key for the partition is removed
from the latter by {{{{LogManager::handleLogDirFailure}}. Therefore, topicId
for a partitioned marked offline always returns {{None}} and new logs for all
partitions in a failed log directory are always created on another disk.
The broker will fail to restart after the failed disk is repaired because same
partitions will occur in two different directories. The error does however
inform the operator to remove the partitions from the disk that failed which
should help with broker startup.
We can avoid this with
[KAFKA-16212|https://issues.apache.org/jira/browse/KAFKA-16212] but in the
short-term, an immediate solution can be to have {{Partition}} object accept
{{Option[TopicId]}} in it's constructor and have it fallback to {{log}} or
{{logManager}} if it's unset.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)