[
https://issues.apache.org/jira/browse/KAFKA-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Leroy updated KAFKA-16296:
--------------------------------
Description:
We have a rolling-restart problem we don't understand on a 3-node cluster.
When stopping a broker, everything goes fine and the partitions are reassigned
to the other brokers.
When that broker restarts, it shrinks ISR because of "Out of sync replicas", a
few minutes after having restarted (here, the restart was at 10:11) :
{code:java}
[2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5
broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542,
endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1,
lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1,
lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition)
[2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5
broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075
(kafka.cluster.Partition) {code}
I do not understand why brokers 1 and 2 would be out of sync, it seems to me
that given that brokers 1 and 2 were not restarted, they should be in sync.
This, of course, causes problems as producers reconnect to broker 3 only to
find the min ISR requirement is not fullfilled.
I have attached the logs for one of the affected partitions, both from broker 3
(the restarted one) and broker 2 (not restarted).
Thanks in advance,
Colin
was:
We have a rolling-restart problem we don't understand on a 3-node cluster.
When stopping a broker, everything goes fine and the partitions are reassigned
to the other brokers.
When that broker restarts, it shrinks ISR because of "Out of sync replicas":
{code:java}
[2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5
broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542,
endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1,
lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1,
lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition)
[2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5
broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075
(kafka.cluster.Partition) {code}
I do not understand why brokers 1 and 2 would be out of sync, it seems to me
that given that brokers 1 and 2 were not restarted, they should be in sync.
This, of course, causes problems as producers reconnect to broker 3 only to
find the min ISR requirement is not fullfilled.
I have attached the logs for one of the affected partitions, both from broker 3
(the restarted one) and broker 2 (not restarted).
Thanks in advance,
Colin
> Broker shrinks ISR when restarting
> ----------------------------------
>
> Key: KAFKA-16296
> URL: https://issues.apache.org/jira/browse/KAFKA-16296
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 3.6.1
> Reporter: Colin Leroy
> Priority: Major
> Attachments: broker2.log, broker3.log
>
>
> We have a rolling-restart problem we don't understand on a 3-node cluster.
> When stopping a broker, everything goes fine and the partitions are
> reassigned to the other brokers.
> When that broker restarts, it shrinks ISR because of "Out of sync replicas",
> a few minutes after having restarted (here, the restart was at 10:11) :
> {code:java}
> [2024-02-22 10:18:02,069] INFO [Partition OSS.PREPROD.Monitoring.Metric-5
> broker=3] Shrinking ISR from 2,1,3 to 3. Leader: (highWatermark: 704389542,
> endOffset: 704395843). Out of sync replicas: (brokerId: 2, endOffset: -1,
> lastCaughtUpTimeMs: 1708593437335) (brokerId: 1, endOffset: -1,
> lastCaughtUpTimeMs: 1708593437335). (kafka.cluster.Partition)
> [2024-02-22 10:18:02,124] INFO [Partition OSS.PREPROD.Monitoring.Metric-5
> broker=3] ISR updated to 3 (under-min-isr) and version updated to 1075
> (kafka.cluster.Partition) {code}
> I do not understand why brokers 1 and 2 would be out of sync, it seems to me
> that given that brokers 1 and 2 were not restarted, they should be in sync.
> This, of course, causes problems as producers reconnect to broker 3 only to
> find the min ISR requirement is not fullfilled.
> I have attached the logs for one of the affected partitions, both from broker
> 3 (the restarted one) and broker 2 (not restarted).
> Thanks in advance,
> Colin
--
This message was sent by Atlassian Jira
(v8.20.10#820010)