Travis Bischel created KAFKA-19235:
--------------------------------------
Summary: STALE_MEMBER_EPOCH is mostly non-recoverable and forces
lost commits when leaving a group (KIP-848)
Key: KAFKA-19235
URL: https://issues.apache.org/jira/browse/KAFKA-19235
Project: Kafka
Issue Type: Bug
Components: clients, consumer
Affects Versions: 4.0.0
Reporter: Travis Bischel
Flow:
* I heartbeat and receive memberEpoch 7, heartbeat interval 5s
* 3s later I want to leave the group
* In my OnRevoke before leaving, I commit offsets
* The broker has bumped the memberEpoch
* My OffsetCommit request fails with STALE_MEMBER_EPOCH
I am leaving the group, there will be no future heartbeat (besides the one
actually leaving the group with memberEpoch -1 or -2) to get a new epoch so
that I can issue a final commit.
What I've tried to do locally is force an inline ConsumerGroupHeartbeat if I
receive STALE_MEMBER_EPOCH from an OffsetCommit response and then reissue the
commit request. Well, Kafka 4 returns FENCED_MEMBER_EPOCH _a lot_, and
frequently this forced ConsumerGroupHeartbeat receives FENCED_MEMBER_EPOCH, and
thus I cannot update the epoch.
Clients are meant to give up all partitions if they experience
FENCED_MEMBER_EPOCH and rejoin with a MemberEpoch of 0. Well, we're already in
the process of giving up partitions. The commit just can't go through.
The Java client looks to just blindly retry the commit without doing anything
with the epoch (likely the epoch is handled elsewhere – and, unless something
shows me otherwise, the Java client should also be experiencing the
FENCED_MEMBER_EPOCH problem if this is being handled elsewhere):
[https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java#L346-L352]
There are some tests in the Java client codebase, but they do not actually test
if the commit is successful. The tests simply check that the commit is
scheduled to be retried:
[https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java#L481-L485]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)