[jira] [Updated] (KAFKA-19407) OffsetsOutOfOrderException on followers due to the race condition in the leader

Haruki Okada (Jira) Mon, 16 Jun 2025 07:27:30 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-19407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Haruki Okada updated KAFKA-19407:
---------------------------------
    Description: 
h2. Environment
 * Kafka version: 3.3.2 (But we suppose this issue still exists in latest Kafka)
 * Replication factor: 3
 * Topic's cleanup.policy: delete

h2. Phenomenon

We experienced a partition in our cluster got UnderMinISR suddenly without any 
hardware/network issue or any operation.
{code:java}
[2025-06-10 20:27:14,310] INFO [Partition topic-X-49 broker=17] Shrinking ISR 
from 15,16,17 to 17. Leader: (highWatermark: 572579089, endOffset: 572579215). 
Out of sync replicas: (brokerId: 15, endOffset: 572579089) (brokerId: 16, 
endOffset: 572579089). (kafka.cluster.Partition)
{code}
On both followers, we saw below log:
{code:java}
[2025-06-10 20:26:59,804] ERROR [ReplicaFetcher replicaId=16, leaderId=17, 
fetcherId=1] Unexpected error occurred while processing data for partition 
topic-X-49 at offset 572579089 (kafka.server.ReplicaFetcherThread)
kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append 
to topic-X-49: ArrayBuffer(572579089, 572579090, 572579091, 572579092, 
572579093, 572579094, 572579095, 572579096, 572579097, 572579098, 572579099, 
572579100, 572579101, 572579102, 572579103, 572579104, 572579105, 572579106, 
572579107, 572579108, 572579109, 572579089,...
{code}
The log tells there's an offset regression after 572579109 (to 572579089).
h2. Analysis
h3. The cause of the offset regression

We dumped active log segment on the leader and confirmed that the leader's log 
actually contains records where non-monotonic offsets assigned.
So the problem must exists on the leader rather than the follower.

On the leader, we found below log is output right before the follower 
experiences OffsetsOutOfOrderException.
{code:java}
[2025-06-10 20:26:37,463] ERROR [ReplicaManager broker=17] Error processing 
append operation on partition topic-X-49 (kafka.server.ReplicaManager)
java.lang.IllegalStateException: Attempt to append a timestamp (1749554796970) 
to slot 1 no larger than the last timestamp appended (1749554797191) to 
/path/to/kafka-log-dir/topic-X-49/00000000000572579060.timeindex.
        at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:128)
        at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114)
        at kafka.log.LogSegment.append(LogSegment.scala:167)
        at kafka.log.LocalLog.append(LocalLog.scala:442)
        at kafka.log.UnifiedLog.append(UnifiedLog.scala:950)
        at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
        at 
kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
        at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
        at 
kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
        at 
scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
        at 
scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
        at scala.collection.mutable.HashMap.map(HashMap.scala:35)
        at 
kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
        at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
        at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
        at java.base/java.lang.Thread.run(Thread.java:829)
{code}
This explains the offset regression as below:
 * Due to the IllegalStateException in the middle of handling produce request, 
LogEndOffset wasn't updated while records were already written to the log 
segment.
 * When the leader handles subsequent produce requests, it assigns same offset 
to the record batch
 ** => Non-monotonic offset sequence happens and it causes 
OffsetsOutOfOrderException on followers

h3. The cause of IllegalStateException

The exception tells that trying to append timestamp 1749554796970 which is 
smaller than the last entry in the time index (1749554797191), that is invalid.

The timestamp about to be appended here is 
[LogSegment#maxTimestampSoFar|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/log/LogSegment.scala#L355].
 It means that maxTimestampSoFar regressed somehow.

After reading the code, we found that below race-condition is possible and it 
explains the phenomenon we experienced:

!image-2025-06-13-23-01-40-371.png|width=597,height=471!

Actually we confirmed the log-segment was rolled out right before the 
phenomenon (i.e. entering [if block at 
LogSegment#L106|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/log/LogSegment.scala#L106]
 is possible) and there were clients calling listOffsets API on the partition, 
which matches to the scenario.
h2. Suggested Fix

We should protect [maxTimestampAndOffsetSoFar 
here|https://github.com/apache/kafka/blob/4.0.0/storage/src/main/java/org/apache/kafka/storage/internals/log/LogSegment.java#L195-L197]
 by double-checked locking

  was:
h2. Environment
 * Kafka version: 3.3.2 (But we suppose this issue still exists in latest Kafka)
 * Replication factor: 3
 * Topic's cleanup.policy: delete

h2. Phenomenon

We experienced a partition in our cluster got UnderMinISR suddenly without any 
hardware/network issue or any operation.
{code:java}
[2025-06-10 20:27:14,310] INFO [Partition topic-X-49 broker=17] Shrinking ISR 
from 15,16,17 to 17. Leader: (highWatermark: 572579089, endOffset: 572579215). 
Out of sync replicas: (brokerId: 15, endOffset: 572579089) (brokerId: 16, 
endOffset: 572579089). (kafka.cluster.Partition)
{code}
On both followers, we saw below log:
{code:java}
[2025-06-10 20:26:59,804] ERROR [ReplicaFetcher replicaId=16, leaderId=17, 
fetcherId=1] Unexpected error occurred while processing data for partition 
topic-X-49 at offset 572579089 (kafka.server.ReplicaFetcherThread)
kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append 
to topic-X-49: ArrayBuffer(572579089, 572579090, 572579091, 572579092, 
572579093, 572579094, 572579095, 572579096, 572579097, 572579098, 572579099, 
572579100, 572579101, 572579102, 572579103, 572579104, 572579105, 572579106, 
572579107, 572579108, 572579109, 572579089,...
{code}
The log tells there's an offset regression after 572579109 (to 572579089).
h2. Analysis
h3. The cause of the offset regression

We dumped active log segment on the leader and confirmed that the leader's log 
actually contains records where non-monotonic offsets assigned.
So the problem must exists on the leader rather than the follower.

On the leader, we found below log is output right before the follower 
experiences OffsetsOutOfOrderException.
{code:java}
[2025-06-10 20:26:37,463] ERROR [ReplicaManager broker=17] Error processing 
append operation on partition topic-X-49 (kafka.server.ReplicaManager)
java.lang.IllegalStateException: Attempt to append a timestamp (1749554796970) 
to slot 1 no larger than the last timestamp appended (1749554797191) to 
/path/to/kafka-log-dir/topic-X-49/00000000000572579060.timeindex.
        at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:128)
        at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114)
        at kafka.log.LogSegment.append(LogSegment.scala:167)
        at kafka.log.LocalLog.append(LocalLog.scala:442)
        at kafka.log.UnifiedLog.append(UnifiedLog.scala:950)
        at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
        at 
kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
        at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
        at 
kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
        at 
scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
        at 
scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
        at scala.collection.mutable.HashMap.map(HashMap.scala:35)
        at 
kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
        at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
        at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
        at java.base/java.lang.Thread.run(Thread.java:829)
{code}
This explains the offset regression as below:
 * Due to the IllegalStateException in the middle of handling produce request, 
LogEndOffset wasn't updated while records were already written to the log 
segment.
 * When the leader handles subsequent produce requests, it assigns same offset 
to the record batch
 ** => Non-monotonic offset sequence happens and it causes 
OffsetsOutOfOrderException on followers

h3. The cause of IllegalStateException

The exception tells that trying to append timestamp 1749554796970 which is 
smaller than the last entry in the time index (1749554797191), that is invalid.

The timestamp about to be appended here is 
[LogSegment#maxTimestampSoFar|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/log/LogSegment.scala#L355].
 It means that maxTimestampSoFar regressed somehow.

After reading the code, we found that below race-condition is possible and it 
explains the phenomenon we experienced:

!image-2025-06-13-23-01-40-371.png|width=597,height=471!

Actually we confirmed the log-segment was rolled out right before the 
phenomenon (i.e. entering [if block at 
LogSegment#L106|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/log/LogSegment.scala#L106]
 is possible) and there were clients calling listOffsets API on the partition, 
which matches to the scenario.
h2. Suggested Fix

TBD.
This is typical check-then-act race and there are several possible ways to fix 
the issue, but still not sure which way is the best.
Anyways, we plan to send a patch later.


> OffsetsOutOfOrderException on followers due to the race condition in the 
> leader
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-19407
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19407
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 3.3.2
>            Reporter: Haruki Okada
>            Assignee: Haruki Okada
>            Priority: Major
>         Attachments: image-2025-06-13-23-01-40-371.png
>
>
> h2. Environment
>  * Kafka version: 3.3.2 (But we suppose this issue still exists in latest 
> Kafka)
>  * Replication factor: 3
>  * Topic's cleanup.policy: delete
> h2. Phenomenon
> We experienced a partition in our cluster got UnderMinISR suddenly without 
> any hardware/network issue or any operation.
> {code:java}
> [2025-06-10 20:27:14,310] INFO [Partition topic-X-49 broker=17] Shrinking ISR 
> from 15,16,17 to 17. Leader: (highWatermark: 572579089, endOffset: 
> 572579215). Out of sync replicas: (brokerId: 15, endOffset: 572579089) 
> (brokerId: 16, endOffset: 572579089). (kafka.cluster.Partition)
> {code}
> On both followers, we saw below log:
> {code:java}
> [2025-06-10 20:26:59,804] ERROR [ReplicaFetcher replicaId=16, leaderId=17, 
> fetcherId=1] Unexpected error occurred while processing data for partition 
> topic-X-49 at offset 572579089 (kafka.server.ReplicaFetcherThread)
> kafka.common.OffsetsOutOfOrderException: Out of order offsets found in append 
> to topic-X-49: ArrayBuffer(572579089, 572579090, 572579091, 572579092, 
> 572579093, 572579094, 572579095, 572579096, 572579097, 572579098, 572579099, 
> 572579100, 572579101, 572579102, 572579103, 572579104, 572579105, 572579106, 
> 572579107, 572579108, 572579109, 572579089,...
> {code}
> The log tells there's an offset regression after 572579109 (to 572579089).
> h2. Analysis
> h3. The cause of the offset regression
> We dumped active log segment on the leader and confirmed that the leader's 
> log actually contains records where non-monotonic offsets assigned.
> So the problem must exists on the leader rather than the follower.
> On the leader, we found below log is output right before the follower 
> experiences OffsetsOutOfOrderException.
> {code:java}
> [2025-06-10 20:26:37,463] ERROR [ReplicaManager broker=17] Error processing 
> append operation on partition topic-X-49 (kafka.server.ReplicaManager)
> java.lang.IllegalStateException: Attempt to append a timestamp 
> (1749554796970) to slot 1 no larger than the last timestamp appended 
> (1749554797191) to 
> /path/to/kafka-log-dir/topic-X-49/00000000000572579060.timeindex.
>         at kafka.log.TimeIndex.$anonfun$maybeAppend$1(TimeIndex.scala:128)
>         at kafka.log.TimeIndex.maybeAppend(TimeIndex.scala:114)
>         at kafka.log.LogSegment.append(LogSegment.scala:167)
>         at kafka.log.LocalLog.append(LocalLog.scala:442)
>         at kafka.log.UnifiedLog.append(UnifiedLog.scala:950)
>         at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
>         at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
>         at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
>         at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
>         at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
>         at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
>         at scala.collection.mutable.HashMap.map(HashMap.scala:35)
>         at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
>         at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
>         at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>         at java.base/java.lang.Thread.run(Thread.java:829)
> {code}
> This explains the offset regression as below:
>  * Due to the IllegalStateException in the middle of handling produce 
> request, LogEndOffset wasn't updated while records were already written to 
> the log segment.
>  * When the leader handles subsequent produce requests, it assigns same 
> offset to the record batch
>  ** => Non-monotonic offset sequence happens and it causes 
> OffsetsOutOfOrderException on followers
> h3. The cause of IllegalStateException
> The exception tells that trying to append timestamp 1749554796970 which is 
> smaller than the last entry in the time index (1749554797191), that is 
> invalid.
> The timestamp about to be appended here is 
> [LogSegment#maxTimestampSoFar|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/log/LogSegment.scala#L355].
>  It means that maxTimestampSoFar regressed somehow.
> After reading the code, we found that below race-condition is possible and it 
> explains the phenomenon we experienced:
> !image-2025-06-13-23-01-40-371.png|width=597,height=471!
> Actually we confirmed the log-segment was rolled out right before the 
> phenomenon (i.e. entering [if block at 
> LogSegment#L106|https://github.com/apache/kafka/blob/3.3.2/core/src/main/scala/kafka/log/LogSegment.scala#L106]
>  is possible) and there were clients calling listOffsets API on the 
> partition, which matches to the scenario.
> h2. Suggested Fix
> We should protect [maxTimestampAndOffsetSoFar 
> here|https://github.com/apache/kafka/blob/4.0.0/storage/src/main/java/org/apache/kafka/storage/internals/log/LogSegment.java#L195-L197]
>  by double-checked locking



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19407) OffsetsOutOfOrderException on followers due to the race condition in the leader

Reply via email to