[
https://issues.apache.org/jira/browse/KAFKA-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815385#comment-17815385
]
Steve Jacobs commented on KAFKA-15467:
--------------------------------------
The way to reproduce this is an unclean shutdown of the broker. Every time I
kill or power off a node I can reproduce this problem.
Personally: It is extremely frustating that no one has looked at or responded
to this issue. I've reached out on the mailing lists, asked on slack (both
confluent and apache), and I have not received a single response on this issue.
Not even a "oh that looks interesting". I feel like a ghost and it is
disheartening to say the least.
> Kafka broker returns offset out of range for topic/partitions on restart from
> unclean shutdown
> ----------------------------------------------------------------------------------------------
>
> Key: KAFKA-15467
> URL: https://issues.apache.org/jira/browse/KAFKA-15467
> Project: Kafka
> Issue Type: Bug
> Components: core, log
> Affects Versions: 3.5.1
> Environment: Apache Kafka 3.5.1 with Strimzi on kubernetes.
> Reporter: Steve Jacobs
> Priority: Major
>
> So this started with me thinking this was a mirrormaker2 issue because here
> are the symptoms I am seeing:
> I'm encountering an odd issue with mirrormaker2 with our remote replication
> setup to high latency remote sites (satellite).
> Every few days we get several topics completely re-replicated, this appears
> to happen after a network connectivity outage. It doesn't matter if it's a
> long outage (hours) or a short one (minutes). And it only seems to affect a
> few topics.
> I was finally able to track down some logs showing the issue. This was after
> an hour-ish long outage where connectivity went down. There were lots of logs
> about connection timeouts, etc. Here is the relevant part when the connection
> came back up:
> {code:java}
> 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker]
> [AdminClient
> clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> Disconnecting from node 0 due to socket connection setup timeout. The
> timeout value is 63245 ms. (org.apache.kafka.clients.NetworkClient)
> [kafka-admin-client-thread |
> mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker]
> [AdminClient
> clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> Metadata update failed
> (org.apache.kafka.clients.admin.internals.AdminMetadataManager)
> [kafka-admin-client-thread |
> mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Disconnecting from node 0 due to socket connection setup
> timeout. The timeout value is 52624 ms.
> (org.apache.kafka.clients.NetworkClient)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Error sending fetch request (sessionId=460667411,
> epoch=INITIAL) to node 0: (org.apache.kafka.clients.FetchSessionHandler)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> 2023-09-08 16:52:47,336 INFO [scbi->gcp.MirrorSourceConnector|worker]
> refreshing topics took 67359 ms (org.apache.kafka.connect.mirror.Scheduler)
> [Scheduler for MirrorSourceConnector:
> scbi->gcp|scbi->gcp.MirrorSourceConnector-refreshing topics]
> 2023-09-08 16:52:48,413 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Fetch position FetchPosition{offset=4918131,
> offsetEpoch=Optional[0],
> currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094
> (id: 0 rack: null)], epoch=0}} is out of range for partition
> reading.sensor.hfp01sc-0, resetting offset
> (org.apache.kafka.clients.consumer.internals.AbstractFetch)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> (Repeats for 11 more topics)
> 2023-09-08 16:52:48,479 INFO [scbi->gcp.MirrorSourceConnector|task-1]
> [Consumer
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
> groupId=null] Resetting offset for partition reading.sensor.hfp01sc-0 to
> position FetchPosition{offset=3444977, offsetEpoch=Optional.empty,
> currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094
> (id: 0 rack: null)], epoch=0}}.
> (org.apache.kafka.clients.consumer.internals.SubscriptionState)
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> (Repeats for 11 more topics) {code}
> The consumer reports that offset 4918131 is out of range for this
> topic/partition, but that offset still exists on the remote cluster. I can go
> pull it up with a consumer right now. The earliest offset in that topic that
> still exists is 3444977 as of yesterday. We have 30 day retention configured
> so pulling in 30 days of duplicate data is very not good. It almost seems
> like a race condition as there are 38 topics we replicate but this only
> affected 12 (on this occurance).
> The number of topics affected seems to vary each time. Today I see one site
> has 2 topics it is resending, and another has 13.
> But since opening this issue originally what I have discovered is that this
> error occurs every time I have a power failure at a remote site I mirror to.
> I can reproduce this issue by doing a hard reset on my single node broker
> setup. The broker comes back up cleanly after processing unflushed messages
> to segments, but while it's coming up I consistently have an issue where I
> get these partition out of range errors from mirrormaker2. I've looked around
> online and found a few other issues that sound similar to mine involving
> kafka cluster upgrades, and cpu hangs, which sounds like a similar state
> (crashing broker, unclean shutdown). Something about unclean recovery is
> causing this to occur.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)