[jira] [Commented] (KAFKA-15467) Kafka broker returns offset out of range for topic/partitions on restart from unclean shutdown

Steve Jacobs (Jira) Wed, 07 Feb 2024 09:46:49 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17815385#comment-17815385
 ]


Steve Jacobs commented on KAFKA-15467:
--------------------------------------

The way to reproduce this is an unclean shutdown of the broker. Every time I 
kill or power off a node I can reproduce this problem. 

Personally: It is extremely frustating that no one has looked at or responded 
to this issue. I've reached out on the mailing lists, asked on slack (both 
confluent and apache), and I have not received a single response on this issue. 
Not even a "oh that looks interesting". I feel like a ghost and it is 
disheartening to say the least.

> Kafka broker returns offset out of range for topic/partitions on restart from 
> unclean shutdown
> ----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-15467
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15467
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, log
>    Affects Versions: 3.5.1
>         Environment: Apache Kafka 3.5.1 with Strimzi on kubernetes.
>            Reporter: Steve Jacobs
>            Priority: Major
>
> So this started with me thinking this was a mirrormaker2 issue because here 
> are the symptoms I am seeing:
> I'm encountering an odd issue with mirrormaker2 with our remote replication 
> setup to high latency remote sites (satellite).
> Every few days we get several topics completely re-replicated, this appears 
> to happen after a network connectivity outage. It doesn't matter if it's a 
> long outage (hours) or a short one (minutes). And it only seems to affect a 
> few topics.
> I was finally able to track down some logs showing the issue. This was after 
> an hour-ish long outage where connectivity went down. There were lots of logs 
> about connection timeouts, etc. Here is the relevant part when the connection 
> came back up:
> {code:java}
> 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker] 
> [AdminClient 
> clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
>  Disconnecting from node 0 due to socket connection setup timeout. The 
> timeout value is 63245 ms. (org.apache.kafka.clients.NetworkClient) 
> [kafka-admin-client-thread | 
> mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> 2023-09-08 16:52:45,380 INFO [scbi->gcp.MirrorSourceConnector|worker] 
> [AdminClient 
> clientId=mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
>  Metadata update failed 
> (org.apache.kafka.clients.admin.internals.AdminMetadataManager) 
> [kafka-admin-client-thread | 
> mm2-admin-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector|replication-source-admin]
> 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1] 
> [Consumer 
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
>  groupId=null] Disconnecting from node 0 due to socket connection setup 
> timeout. The timeout value is 52624 ms. 
> (org.apache.kafka.clients.NetworkClient) 
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> 2023-09-08 16:52:47,029 INFO [scbi->gcp.MirrorSourceConnector|task-1] 
> [Consumer 
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
>  groupId=null] Error sending fetch request (sessionId=460667411, 
> epoch=INITIAL) to node 0: (org.apache.kafka.clients.FetchSessionHandler) 
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> 2023-09-08 16:52:47,336 INFO [scbi->gcp.MirrorSourceConnector|worker] 
> refreshing topics took 67359 ms (org.apache.kafka.connect.mirror.Scheduler) 
> [Scheduler for MirrorSourceConnector: 
> scbi->gcp|scbi->gcp.MirrorSourceConnector-refreshing topics]
> 2023-09-08 16:52:48,413 INFO [scbi->gcp.MirrorSourceConnector|task-1] 
> [Consumer 
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
>  groupId=null] Fetch position FetchPosition{offset=4918131, 
> offsetEpoch=Optional[0], 
> currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094
>  (id: 0 rack: null)], epoch=0}} is out of range for partition 
> reading.sensor.hfp01sc-0, resetting offset 
> (org.apache.kafka.clients.consumer.internals.AbstractFetch) 
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> (Repeats for 11 more topics)
> 2023-09-08 16:52:48,479 INFO [scbi->gcp.MirrorSourceConnector|task-1] 
> [Consumer 
> clientId=mm2-consumer-scbi|scbi->gcp|scbi->gcp.MirrorSourceConnector-1|replication-consumer,
>  groupId=null] Resetting offset for partition reading.sensor.hfp01sc-0 to 
> position FetchPosition{offset=3444977, offsetEpoch=Optional.empty, 
> currentLeader=LeaderAndEpoch{leader=Optional[kafka.scbi.eng.neoninternal.org:9094
>  (id: 0 rack: null)], epoch=0}}. 
> (org.apache.kafka.clients.consumer.internals.SubscriptionState) 
> [task-thread-scbi->gcp.MirrorSourceConnector-1]
> (Repeats for 11 more topics) {code}
> The consumer reports that offset 4918131 is out of range for this 
> topic/partition, but that offset still exists on the remote cluster. I can go 
> pull it up with a consumer right now. The earliest offset in that topic that 
> still exists is 3444977 as of yesterday. We have 30 day retention configured 
> so pulling in 30 days of duplicate data is very not good. It almost seems 
> like a race condition as there are 38 topics we replicate but this only 
> affected 12 (on this occurance).  
> The number of topics affected seems to vary each time. Today I see one site 
> has 2 topics it is resending, and another has 13.
> But since opening this issue originally what I have discovered is that this 
> error occurs every time I have a power failure at a remote site I mirror to. 
> I can reproduce this issue by doing a hard reset on my single node broker 
> setup. The broker comes back up cleanly after processing unflushed messages 
> to segments, but while it's coming up I consistently have an issue where I 
> get these partition out of range errors from mirrormaker2. I've looked around 
> online and found a few other issues that sound similar to mine involving 
> kafka cluster upgrades, and cpu hangs, which sounds like a similar state 
> (crashing broker, unclean shutdown). Something about unclean recovery is 
> causing this to occur. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-15467) Kafka broker returns offset out of range for topic/partitions on restart from unclean shutdown

Reply via email to