blinkeye commented on PR #1898: URL: https://github.com/apache/zookeeper/pull/1898#issuecomment-2226826413
Thank you @luke-sterkowicz for the effort and proposal. I've been observing the same issue. There's another reason for introducing this enhancement: it helps operating ZK. Currently if you have a cluster and quorum and remove one instance (while still maintaining quorum) you get a WARN along with a `java.lang.InterruptedException: null` exception on each instance, as well as an actual ERROR, which looks concerning. Here's an example of a 4 instances with quorum and a `kill instance-3` (which is what [bin/zkServer.sh stop](https://github.com/apache/zookeeper/blob/master/bin/zkServer.sh#L216-L227) is doing) and the corresponding logs. The ERROR is an `Unexpected exception`: ```bash zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - ERROR [LearnerHandler-/192.168.64.4:47434:o.a.z.s.q.LearnerHandler@720] - Unexpected exception in LearnerHandler: zookeeper-2 | java.io.EOFException: null ``` This looks concerning and I would start a RCA. Even the WARN have a `null` exception pointing to an actual issue. As it turns out this happens every time an instance is removed. All the logs when `zookeeper-3` is removed, this is reproducible on [v3.9.2](https://github.com/apache/zookeeper/releases/tag/release-3.9.2). ```bash zookeeper-1 | 2024-07-13 08:21:58,033 [myid:] - WARN [RecvWorker:3:o.a.z.s.q.QuorumCnxManager$RecvWorker@1402] - Connection broken for id 3, my id = 1 zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - WARN [RecvWorker:3:o.a.z.s.q.QuorumCnxManager$RecvWorker@1402] - Connection broken for id 3, my id = 2 zookeeper-1 | java.io.EOFException: null zookeeper-4 | 2024-07-13 08:21:58,033 [myid:] - WARN [RecvWorker:3:o.a.z.s.q.QuorumCnxManager$RecvWorker@1402] - Connection broken for id 3, my id = 4 zookeeper-2 | java.io.EOFException: null zookeeper-1 | at java.base/java.io.DataInputStream.readInt(Unknown Source) zookeeper-4 | java.io.EOFException: null zookeeper-2 | at java.base/java.io.DataInputStream.readInt(Unknown Source) zookeeper-1 | at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1390) zookeeper-1 | 2024-07-13 08:21:58,033 [myid:] - WARN [RecvWorker:3:o.a.z.s.q.QuorumCnxManager$RecvWorker@1408] - Interrupting SendWorker thread from RecvWorker. sid: 3. myId: 1 zookeeper-4 | at java.base/java.io.DataInputStream.readInt(Unknown Source) zookeeper-4 | at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1390) zookeeper-4 | 2024-07-13 08:21:58,033 [myid:] - WARN [RecvWorker:3:o.a.z.s.q.QuorumCnxManager$RecvWorker@1408] - Interrupting SendWorker thread from RecvWorker. sid: 3. myId: 4 zookeeper-2 | at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1390) zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - ERROR [LearnerHandler-/192.168.64.4:47434:o.a.z.s.q.LearnerHandler@720] - Unexpected exception in LearnerHandler: zookeeper-2 | java.io.EOFException: null zookeeper-2 | at java.base/java.io.DataInputStream.readInt(Unknown Source) zookeeper-2 | at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:96) zookeeper-2 | at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:86) zookeeper-2 | at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:134) zookeeper-2 | at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:657) zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - WARN [RecvWorker:3:o.a.z.s.q.QuorumCnxManager$RecvWorker@1408] - Interrupting SendWorker thread from RecvWorker. sid: 3. myId: 2 zookeeper-1 | 2024-07-13 08:21:58,033 [myid:] - WARN [SendWorker:3:o.a.z.s.q.QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - INFO [LearnerHandler-/192.168.64.4:47434:o.a.z.s.q.LearnerHandler@1160] - Synchronously closing socket to learner 3. zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - WARN [LearnerHandler-/192.168.64.4:47434:o.a.z.s.q.LearnerHandler@736] - ******* GOODBYE /192.168.64.4:47434 ******** zookeeper-1 | java.lang.InterruptedException: null zookeeper-1 | at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) zookeeper-1 | at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105) zookeeper-1 | at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453) zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - WARN [SendWorker:3:o.a.z.s.q.QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue zookeeper-2 | java.lang.InterruptedException: null zookeeper-2 | at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) zookeeper-2 | at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105) zookeeper-2 | at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453) zookeeper-2 | at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99) zookeeper-2 | at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277) zookeeper-2 | 2024-07-13 08:21:58,033 [myid:] - WARN [SendWorker:3:o.a.z.s.q.QuorumCnxManager$SendWorker@1300] - Send worker leaving thread id 3 my id = 2 zookeeper-1 | at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99) zookeeper-1 | at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277) zookeeper-1 | 2024-07-13 08:21:58,033 [myid:] - WARN [SendWorker:3:o.a.z.s.q.QuorumCnxManager$SendWorker@1300] - Send worker leaving thread id 3 my id = 1 zookeeper-4 | 2024-07-13 08:21:58,033 [myid:] - WARN [SendWorker:3:o.a.z.s.q.QuorumCnxManager$SendWorker@1288] - Interrupted while waiting for message on queue zookeeper-4 | java.lang.InterruptedException: null zookeeper-4 | at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source) zookeeper-4 | at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105) zookeeper-4 | at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453) zookeeper-4 | at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99) zookeeper-4 | at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277) zookeeper-4 | 2024-07-13 08:21:58,033 [myid:] - WARN [SendWorker:3:o.a.z.s.q.QuorumCnxManager$SendWorker@1300] - Send worker leaving thread id 3 my id = 4 zookeeper-3 exited with code 143 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
