[
https://issues.apache.org/jira/browse/KAFKA-13191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
CS User updated KAFKA-13191:
----------------------------
Description:
We're using confluent platform 6.2, running in a Kubernetes environment. The
cluster has been running for a couple of years with zero issues, starting from
version 1.1, 2.5 and now 2.8.
We've very recently upgraded to kafka 2.8 from kafka 2.5.
Since upgrading, we have seen issues when kafka and zookeeper pods restart
concurrently.
We can replicate the issue by restarting either the zookeeper statefulset first
or the kafka statefulset first, either way appears to result with the same
failure scenario.
We've attempted to mitigate by preventing the kafka pods from stopping if any
zookeeper pods are being restarted, or a rolling restart of the zookeeper
cluster is underway.
We've also added a check to stop the kafka pods from starting until all
zookeeper pods are ready, however under the following scenario we still see the
issue:
In a 3 node kafka cluster with 5 zookeeper servers
# kafka-2 starts to terminate - all zookeeper pods are running, so it proceeds
# zookeeper-4 terminates
# kafka-2 starts-up, and waits until the zookeeper rollout completes
# kafka-2 eventually fully starts, kafka comes up and we see the errors below
on other pods in the cluster.
Without mitigation and in the above scenario we see errors on pods kafka-0:
{noformat}
[2021-08-11 11:45:57,625] WARN Broker had a stale broker epoch (670014914375),
retrying. (kafka.server.DefaultAlterIsrManager){noformat}
Kafka-1 seems ok
When kafka-2 starts, it has this log entry with regards to its own broker epoch:
{noformat}
[2021-08-11 11:44:48,116] INFO Registered broker 2 at path /brokers/ids/2 with
addresses:
INTERNAL://kafka-2.kafka.svc.cluster.local:9092,INTERNAL_SECURE://kafka-2.kafka.svc.cluster.local:9094,
czxid (broker epoch): 674309865493 (kafka.zk.KafkaZkClient) {noformat}
This never appears to recover.
If you then restart kafka-2, you'll see these errors:
{noformat}
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication
factor: 3 larger than available brokers: 0. {noformat}
This seems to completely break the cluster, partitions do not failover as
expected.
Checking zookeeper, and getting the values of the brokers look fine
{noformat}
get /brokers/ids/0{noformat}
etc, all looks fine there, each broker is present.
This error message appears to have been added to kafka in the last 11 months
{noformat}
Broker had a stale broker epoch {noformat}
Via this PR:
[https://github.com/apache/kafka/pull/9100]
I see also this comment around the leader getting stuck:
[https://github.com/apache/kafka/pull/9100/files#r494480847]
Recovery is possible by continuing to restart the remaining brokers in the
cluster. Once all have been restarted, everything looks fine.
Has anyone else come across this? It seems very simple to replicate in our
environment, simply start a simultaneous rolling restart of both kafka and
zookeeper.
I appreciate that Zookeeper and Kafka would not normally be restarted
concurrently in this way. However there are going to be scenarios where this
can happen, such as if we had simultaneous Kubernetes node failures, resulting
in the loss of both a zookeeper and a kafka pod at the same time. This could
result in the issue above.
This is not something that we have seen previously with versions 1.1 or 2.5.
Just to be clear, rolling restarting only kafka or zookeeper is absolutely
fine.
was:
We're using confluent platform 6.2, running in a Kubernetes environment. The
cluster has been running for a couple of years with zero issues, starting from
version 1.1, 2.5 and now 2.8.
We've very recently upgraded to kafka 2.8 from kafka 2.5.
Since upgrading, we have seen issues when kafka and zookeeper pods restart
concurrently.
We can replicate the issue by restarting either the zookeeper statefulset first
or the kafka statefulset first, either way appears to result with the same
failure scenario.
We've attempted to mitigate by preventing the kafka pods from stopping if any
zookeeper pods are being restarted, or a rolling restart of the zookeeper
cluster is underway.
We've also added a check to stop the kafka pods from starting until all
zookeeper pods are ready, however under the following scenario we still see the
issue:
In a 3 node kafka cluster with 5 zookeeper servers
# kafka-2 starts to terminate - all zookeeper pods are running, so it proceeds
# zookeeper-4 terminates
# kafka-2 starts-up, and waits until the zookeeper rollout completes
# kafka-2 eventually fully starts, kafka comes up and we see the errors below
on other pods in the cluster.
Without mitigation and in the above scenario we see errors on pods
kafka-0/kafka-1:
{noformat}
Broker had a stale broker epoch (635655171205), retrying.
(kafka.server.DefaultAlterIsrManager) {noformat}
This never appears to recover.
If you then restart kafka-2, you'll see these errors:
{noformat}
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication
factor: 3 larger than available brokers: 0. {noformat}
This seems to completely break the cluster, partitions do not failover as
expected.
Checking zookeeper, and getting the values of the brokers look fine
{noformat}
get /brokers/ids/0{noformat}
etc, all looks fine there, each broker is present.
This error message appears to have been added to kafka in the last 11 months
{noformat}
Broker had a stale broker epoch {noformat}
Via this PR:
[https://github.com/apache/kafka/pull/9100]
I see also this comment around the leader getting stuck:
[https://github.com/apache/kafka/pull/9100/files#r494480847]
Recovery is possible by continuing to restart the remaining brokers in the
cluster. Once all have been restarted, everything looks fine.
Has anyone else come across this? It seems very simple to replicate in our
environment, simply start a simultaneous rolling restart of both kafka and
zookeeper.
I appreciate that Zookeeper and Kafka would not normally be restarted
concurrently in this way. However there are going to be scenarios where this
can happen, such as if we had simultaneous Kubernetes node failures, resulting
in the loss of both a zookeeper and a kafka pod at the same time. This could
result in the issue above.
This is not something that we have seen previously with versions 1.1 or 2.5.
Just to be clear, rolling restarting only kafka or zookeeper is absolutely
fine.
> Kafka 2.8 - simultaneous restarts of Kafka and zookeeper result in broken
> cluster
> ---------------------------------------------------------------------------------
>
> Key: KAFKA-13191
> URL: https://issues.apache.org/jira/browse/KAFKA-13191
> Project: Kafka
> Issue Type: Bug
> Components: protocol
> Affects Versions: 2.8.0
> Reporter: CS User
> Priority: Major
>
> We're using confluent platform 6.2, running in a Kubernetes environment. The
> cluster has been running for a couple of years with zero issues, starting
> from version 1.1, 2.5 and now 2.8.
> We've very recently upgraded to kafka 2.8 from kafka 2.5.
> Since upgrading, we have seen issues when kafka and zookeeper pods restart
> concurrently.
> We can replicate the issue by restarting either the zookeeper statefulset
> first or the kafka statefulset first, either way appears to result with the
> same failure scenario.
> We've attempted to mitigate by preventing the kafka pods from stopping if any
> zookeeper pods are being restarted, or a rolling restart of the zookeeper
> cluster is underway.
> We've also added a check to stop the kafka pods from starting until all
> zookeeper pods are ready, however under the following scenario we still see
> the issue:
> In a 3 node kafka cluster with 5 zookeeper servers
> # kafka-2 starts to terminate - all zookeeper pods are running, so it
> proceeds
> # zookeeper-4 terminates
> # kafka-2 starts-up, and waits until the zookeeper rollout completes
> # kafka-2 eventually fully starts, kafka comes up and we see the errors
> below on other pods in the cluster.
> Without mitigation and in the above scenario we see errors on pods kafka-0:
> {noformat}
> [2021-08-11 11:45:57,625] WARN Broker had a stale broker epoch
> (670014914375), retrying. (kafka.server.DefaultAlterIsrManager){noformat}
> Kafka-1 seems ok
> When kafka-2 starts, it has this log entry with regards to its own broker
> epoch:
> {noformat}
> [2021-08-11 11:44:48,116] INFO Registered broker 2 at path /brokers/ids/2
> with addresses:
> INTERNAL://kafka-2.kafka.svc.cluster.local:9092,INTERNAL_SECURE://kafka-2.kafka.svc.cluster.local:9094,
> czxid (broker epoch): 674309865493 (kafka.zk.KafkaZkClient) {noformat}
> This never appears to recover.
> If you then restart kafka-2, you'll see these errors:
> {noformat}
> org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication
> factor: 3 larger than available brokers: 0. {noformat}
> This seems to completely break the cluster, partitions do not failover as
> expected.
>
> Checking zookeeper, and getting the values of the brokers look fine
> {noformat}
> get /brokers/ids/0{noformat}
> etc, all looks fine there, each broker is present.
>
> This error message appears to have been added to kafka in the last 11 months
> {noformat}
> Broker had a stale broker epoch {noformat}
> Via this PR:
> [https://github.com/apache/kafka/pull/9100]
> I see also this comment around the leader getting stuck:
> [https://github.com/apache/kafka/pull/9100/files#r494480847]
>
> Recovery is possible by continuing to restart the remaining brokers in the
> cluster. Once all have been restarted, everything looks fine.
> Has anyone else come across this? It seems very simple to replicate in our
> environment, simply start a simultaneous rolling restart of both kafka and
> zookeeper.
> I appreciate that Zookeeper and Kafka would not normally be restarted
> concurrently in this way. However there are going to be scenarios where this
> can happen, such as if we had simultaneous Kubernetes node failures,
> resulting in the loss of both a zookeeper and a kafka pod at the same time.
> This could result in the issue above.
> This is not something that we have seen previously with versions 1.1 or 2.5.
> Just to be clear, rolling restarting only kafka or zookeeper is absolutely
> fine.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)