[
https://issues.apache.org/jira/browse/KAFKA-19242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sanghyeok An updated KAFKA-19242:
---------------------------------
Description:
### Motivation
While investigating “events skipped in group rebalancing”
(spring‑projects/spring‑kafka#3703) I discovered a race condition between
- the main poll/commit thread, and
- the consumer‑coordinator heartbeat thread.
If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` while
the heartbeat thread is finishing a rebalance
(`SyncGroupResponseHandler.handle()`), the group state transitions in the
following order:
```
COMPLETING_REBALANCE → (race window) → STABLE
```
Because we read the state twice without a lock:
1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`),
2. the heartbeat thread flips the state to `STABLE`,
3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides
that a rebalance is still active,
4. a spurious `CommitFailedException` is returned even though the commit could
succeed.
For more details, please refer to sequence diagram below. <img
width="1494" alt="image"
src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c"
/>
### Impact
- The exception is semantically wrong: the consumer is in a stable group, but
reports failure.
- Frameworks and applications that rely on the semantics of
`CommitFailedException` and `RetryableCommitException` (for example `Spring
Kafka`) take the wrong code path, which can ultimately skip the events and
break “at‑most‑once” guarantees.
### Fix
We enlarge the synchronized block in
`ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group
state is examined atomically with respect to the heartbeat thread:
was:
There has been the issue that events skipped in group rebalancing
([spring-projects/spring-kafka#3703|https://github.com/spring-projects/spring-kafka/issues/3703])
.
At the first, I thought it was caused from spring kafka.
However, After digging into the problem with debug, I concluded it was a
race condition issue in Kafka.
A race condition between the main thread and the consumer coordinator's
heartbeat thread exists when the main thread attempts to commit via
{{commitSync(...)}} while the consumer coordinator thread is handling
consumer group rebalancing.
Especially, this race condition will cause event skip frequently in case
of {{CooperativeSticky}} in used.
For more details, please refer to sequence diagram below.
!image-2025-05-05-19-19-06-147.png|width=592,height=350!
> Fix commit bugs caused by race condition during rebalancing.
> ------------------------------------------------------------
>
> Key: KAFKA-19242
> URL: https://issues.apache.org/jira/browse/KAFKA-19242
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Reporter: sanghyeok An
> Assignee: sanghyeok An
> Priority: Major
> Attachments: image-2025-05-05-19-19-06-147.png
>
>
> ### Motivation
> While investigating “events skipped in group rebalancing”
> (spring‑projects/spring‑kafka#3703) I discovered a race condition between
> - the main poll/commit thread, and
> - the consumer‑coordinator heartbeat thread.
> If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()`
> while the heartbeat thread is finishing a rebalance
> (`SyncGroupResponseHandler.handle()`), the group state transitions in the
> following order:
> ```
> COMPLETING_REBALANCE → (race window) → STABLE
> ```
> Because we read the state twice without a lock:
> 1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`),
> 2. the heartbeat thread flips the state to `STABLE`,
> 3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides
> that a rebalance is still active,
> 4. a spurious `CommitFailedException` is returned even though the commit
> could succeed.
> For more details, please refer to sequence diagram below. <img
> width="1494" alt="image"
> src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c"
> />
> ### Impact
> - The exception is semantically wrong: the consumer is in a stable group, but
> reports failure.
> - Frameworks and applications that rely on the semantics of
> `CommitFailedException` and `RetryableCommitException` (for example `Spring
> Kafka`) take the wrong code path, which can ultimately skip the events and
> break “at‑most‑once” guarantees.
> ### Fix
> We enlarge the synchronized block in
> `ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group
> state is examined atomically with respect to the heartbeat thread:
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)