[jira] [Updated] (KAFKA-16234) Log directory failure re-creates partitions in another logdir automatically

Gaurav Narula (Jira) Wed, 07 Feb 2024 07:04:10 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gaurav Narula updated KAFKA-16234:
----------------------------------
    Description: 
With [KAFKA-16157|https://github.com/apache/kafka/pull/15263] we made changes 
in {{HostedPartition.Offline}} enum variant to embed {{Partition}} object. 
Further, {{ReplicaManager::getOrCreatePartition}} tries to compare the old and 
new topicIds to decide if it needs to create a new log.

The getter for {{Partition::topicId}} relies on retrieving the topicId from 
{{log}} field or {{{}logManager.currentLogs{}}}. The former is set to {{None}} 
when a partition is marked offline and the key for the partition is removed 
from the latter by {{{}LogManager::handleLogDirFailure{}}}. Therefore, topicId 
for a partitioned marked offline always returns {{None}} and new logs for all 
partitions in a failed log directory are always created on another disk.

The broker will fail to restart after the failed disk is repaired because same 
partitions will occur in two different directories. The error does however 
inform the operator to remove the partitions from the disk that failed which 
should help with broker startup.

We can avoid this with KAFKA-16212 but in the short-term, an immediate solution 
can be to have {{Partition}} object accept {{Option[TopicId]}} in it's 
constructor and have it fallback to {{log}} or {{logManager}} if it's unset.

  was:
With [KAFKA-16157|https://github.com/apache/kafka/pull/15263] we made changes 
in {{HostedPartition.Offline}} enum variant to embed {{Partition}} object. 
Further, {{ReplicaManager::getOrCreatePartition}} tries to compare the old and 
new topicIds to decide if it needs to create a new log.

The getter for `Partition::topicId` relies on retrieving the topicId from 
{{log}} field or {{logManager.currentLogs}}. The former is set to {{None}} when 
a partition is marked offline and the key for the partition is removed from the 
latter by {{LogManager::handleLogDirFailure}}. Therefore, topicId for a 
partitioned marked offline always returns {{None}} and new logs for all 
partitions in a failed log directory are always created on another disk.

The broker will fail to restart after the failed disk is repaired because same 
partitions will occur in two different directories. The error does however 
inform the operator to remove the partitions from the disk that failed which 
should help with broker startup.

We can avoid this with 
[KAFKA-16212|https://issues.apache.org/jira/browse/KAFKA-16212] but in the 
short-term, an immediate solution can be to have {{Partition}} object accept 
{{Option[TopicId]}} in it's constructor and have it fallback to {{log}} or 
{{logManager}} if it's unset.



> Log directory failure re-creates partitions in another logdir automatically
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-16234
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16234
>             Project: Kafka
>          Issue Type: Bug
>          Components: jbod
>    Affects Versions: 3.7.0
>            Reporter: Gaurav Narula
>            Assignee: Omnia Ibrahim
>            Priority: Major
>
> With [KAFKA-16157|https://github.com/apache/kafka/pull/15263] we made changes 
> in {{HostedPartition.Offline}} enum variant to embed {{Partition}} object. 
> Further, {{ReplicaManager::getOrCreatePartition}} tries to compare the old 
> and new topicIds to decide if it needs to create a new log.
> The getter for {{Partition::topicId}} relies on retrieving the topicId from 
> {{log}} field or {{{}logManager.currentLogs{}}}. The former is set to 
> {{None}} when a partition is marked offline and the key for the partition is 
> removed from the latter by {{{}LogManager::handleLogDirFailure{}}}. 
> Therefore, topicId for a partitioned marked offline always returns {{None}} 
> and new logs for all partitions in a failed log directory are always created 
> on another disk.
> The broker will fail to restart after the failed disk is repaired because 
> same partitions will occur in two different directories. The error does 
> however inform the operator to remove the partitions from the disk that 
> failed which should help with broker startup.
> We can avoid this with KAFKA-16212 but in the short-term, an immediate 
> solution can be to have {{Partition}} object accept {{Option[TopicId]}} in 
> it's constructor and have it fallback to {{log}} or {{logManager}} if it's 
> unset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16234) Log directory failure re-creates partitions in another logdir automatically

Reply via email to