[
https://issues.apache.org/jira/browse/KAFKA-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057790#comment-18057790
]
Gergely Harmadás commented on KAFKA-20109:
------------------------------------------
Hi [~svdewitmam], I have started looking at the issue, feel free to assign it
to me.
> Complete Kafka cluster dies on incorrect SSL config of a single controller
> --------------------------------------------------------------------------
>
> Key: KAFKA-20109
> URL: https://issues.apache.org/jira/browse/KAFKA-20109
> Project: Kafka
> Issue Type: Bug
> Components: config, controller
> Affects Versions: 4.1.1
> Environment: Debian trixie x86_64, Apache Kafka 3.9.0 - 4.1.1
> Reporter: Sven Dewit
> Priority: Major
> Attachments: reproduce.tar.gz
>
>
> Hello,
> we've recently run into a bug in Apache Kafka in Kraft mode where a whole
> mtls-enabled cluster (controllers + brokers) die if a single controller is
> (re)started with bad ssl principal mapping rules.
> The bad config of course was appllied unintentionally when doing some changes
> in the config management of the system, basically it led to
> {{ssl.principal.mapping.rules}} missing for the controller listener on that
> one node. As soon as this single controller was restarted, the whole cluster
> died within seconds, both controllers and brokers, with this error message:
> {code:java}
> ERROR Encountered fatal fault: Unexpected error in raft IO thread
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> org.apache.kafka.common.errors.ClusterAuthorizationException: Received
> cluster authorization error in response InboundResponse(correlationId=493,
> data=BeginQuorumEpochResponseData(errorCode=31, topics=[], nodeEndpoints=[]),
> source=controller-3:9093 (id: 103 rack: null isFenced: false)) {code}
> While the missing/bad ssl principal mapping is a major misconfiguration on a
> cluster where in-cluster communication is based on mtls, this still should
> not lead to the whole cluster terminating.
> The issue occurred on version 4.1.1 of Apache Kafka, but could be reproduced
> back to 3.9.0.
> To reproduce, see the attached tarball containing
> * {{gen-test-ca-and-certs.sh}} to create ca and certificates for brokers and
> controllers to work in mtls mode
> * {{compose.yml}} to spin up the cluster with {{podman compose}}
> Once the cluster is running, the following steps reproduce the error:
> * {{podman compose down controller-3}} to stop controller 3
> * uncomment line 53 in {{compose.yml}} to delete controller 3's
> {{ssl.principal.mapping.rules}}
> * {{podman compose up controller-3}} and watch the cluster go down the drain
>
> In case I can provide you with any more information or support don't hesitate
> to reach out to me.
>
> Best regards,
> Sven
--
This message was sent by Atlassian Jira
(v8.20.10#820010)