Fathima Khazana Abdul Haiyum created KAFKA-13979:
----------------------------------------------------
Summary: Kafka resets committed offset after rebalance
Key: KAFKA-13979
URL: https://issues.apache.org/jira/browse/KAFKA-13979
Project: Kafka
Issue Type: Bug
Affects Versions: 2.6.2
Reporter: Fathima Khazana Abdul Haiyum
We have 3 nodes in our MSK cluster which run Apache Kafka 2.6.2. We have 15
partitions for a topic and 5 consumers in our consumer group, where each
consumer runs on it's own java application server. Whenever we deploy(rolling)
to our servers, we notice a huge consumer lag on *some* of the 15 partitions.
It appears that the consumer after rebalancing resets its committed offset and
reprocesses messages. For example: this is what I'm seeing:
{{}}
{code:java}
{code}
{{logger_name:org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
message:[Consumer clientId=myService-mytopic-0, groupId=myService-mytopic]
Committed offset 3044 for partition mytopic-0}}
So we know for a fact that the offset 3044 has been committed for partition 0.
Running {{`./kafka-consumer-groups.sh --describe` }} gives the following:
{{}}
{code:java}
{code}
{{GROUP PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID CLIENT-ID
myService-mytopic 0 3044 3044 0 myService-mytopic-0}}
{{ }}
After a deploy, which removes the consumer from the group and triggers a
rebalance + adds the consumer back, I see this:
{{}}
{code:java}
{code}
{{GROUP PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID CLIENT-ID
myService-mytopic 0 1890 3047 1157 myService-mytopic-0}}
{{ }}
In the application logs, I see this:
{{}}
{code:java}
{code}
{{logger_name:org.apache.kafka.clients.consumer.internals.Fetcher
message:[Consumer clientId=myService-mytopic-0, groupId=myService-mytopic]
Fetch position FetchPosition\{offset=1890, offsetEpoch=Optional.empty,
currentLeader=LeaderAndEpoch{leader=Optional[b-3.kafka-mytestserver.1gkwlu.c16.kafka.us-east-1.amazonaws.com:9098
(id: 3 rack: use1-az1)], epoch=0}} is out of range for partition mytopic-0,
resetting offset}}
{{ }} Why is kafka fetching the current-offset 1890 which is beyond the
committed offset for the partition after rebalance? This is on a test
environment where less than 1 message is produced per second. This issue occurs
for both auto commit (default interval) and manual commit mechanisms and on
kafka versions 2.6.2 and 2.8.1. On production, we have much more traffic and
causes reprocessing of around 2 million messages per partition.
{{`auto.offset.reset=latest`}} if that matters. We're using the java client
`kafka-clients` version 3.0.0.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)