[
https://issues.apache.org/jira/browse/KAFKA-14693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
José Armando García Sancio resolved KAFKA-14693.
------------------------------------------------
Resolution: Fixed
> KRaft Controller and ProcessExitingFaultHandler can deadlock shutdown
> ---------------------------------------------------------------------
>
> Key: KAFKA-14693
> URL: https://issues.apache.org/jira/browse/KAFKA-14693
> Project: Kafka
> Issue Type: Bug
> Components: controller
> Affects Versions: 3.4.0
> Reporter: José Armando García Sancio
> Assignee: José Armando García Sancio
> Priority: Critical
> Fix For: 3.5.0, 3.4.1
>
>
> h1. Problem
> When the kraft controller encounters an error that it cannot handle it calls
> {{ProcessExitingFaultHandler}} which calls {{Exit.exit}} which calls
> {{{}Runtime.exit{}}}.
> Based on the Runtime.exit documentation:
> {quote}All registered [shutdown
> hooks|https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#addShutdownHook-java.lang.Thread-],
> if any, are started in some unspecified order and allowed to run
> concurrently until they finish. Once this is done the virtual machine
> [halts|https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#halt-int-].
> {quote}
> One of the shutdown hooks registered by Kafka is {{{}Server.shutdown(){}}}.
> This shutdown hook eventually calls {{{}KafkaEventQueue.close{}}}. This last
> close method joins on the controller thread. Unfortunately, the controller
> thread also joined waiting for the shutdown hook thread to finish.
> Here are an sample thread stacks:
> {code:java}
> "QuorumControllerEventHandler" #45 prio=5 os_prio=0 cpu=429352.87ms
> elapsed=620807.49s allocated=38544M defined_classes=353
> tid=0x00007f5aeb31f800 nid=0x80c in Object.wait() [0x00007f5a658fb000]
> java.lang.Thread.State: WAITING (on object monitor)
>
>
> at java.lang.Object.wait([email protected]/Native
> Method)
> - waiting on <no object reference available>
> at java.lang.Thread.join([email protected]/Thread.java:1304)
> - locked <0x00000000a29241f8> (a
> org.apache.kafka.common.utils.KafkaThread)
> at java.lang.Thread.join([email protected]/Thread.java:1372)
> at
> java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
> at
> java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
> at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
> at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
> - locked <0x00000000ffe020b8> (a java.lang.Class for
> java.lang.Shutdown)
> at java.lang.Runtime.exit([email protected]/Runtime.java:115)
> at java.lang.System.exit([email protected]/System.java:1860)
> at org.apache.kafka.common.utils.Exit$2.execute(Exit.java:43)
> at org.apache.kafka.common.utils.Exit.exit(Exit.java:66)
> at org.apache.kafka.common.utils.Exit.exit(Exit.java:62)
> at
> org.apache.kafka.server.fault.ProcessExitingFaultHandler.handleFault(ProcessExitingFaultHandler.java:54)
> at
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:891)
> at
> org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:874)
> at
> org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:969){code}
> and
> {code:java}
> "kafka-shutdown-hook" #35 prio=5 os_prio=0 cpu=43.42ms elapsed=378593.04s
> allocated=4732K defined_classes=74 tid=0x00007f5a7c09d800 nid=0x4f37 in
> Object.wait() [0x00007f5a47afd000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait([email protected]/Native Method)
> - waiting on <no object reference available>
> at java.lang.Thread.join([email protected]/Thread.java:1304)
> - locked <0x00000000a272bcb0> (a
> org.apache.kafka.common.utils.KafkaThread)
> at java.lang.Thread.join([email protected]/Thread.java:1372)
> at
> org.apache.kafka.queue.KafkaEventQueue.close(KafkaEventQueue.java:509)
> at
> org.apache.kafka.controller.QuorumController.close(QuorumController.java:2553)
> at
> kafka.server.ControllerServer.shutdown(ControllerServer.scala:521)
> at kafka.server.KafkaRaftServer.shutdown(KafkaRaftServer.scala:184)
> at kafka.Kafka$.$anonfun$main$3(Kafka.scala:99)
> at kafka.Kafka$$$Lambda$406/0x0000000800fb9730.apply$mcV$sp(Unknown
> Source)
> at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38)
> at kafka.Kafka$$$Lambda$407/0x0000000800fb9a10.run(Unknown Source)
> at java.lang.Thread.run([email protected]/Thread.java:833)
> at
> org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:64) {code}
> h1. Possible Solution
> A possible solution is to have the controller's unhandled fault handler call
> {{Runtime.halt}} instead of {{{}Runtime.exit{}}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)