Bill Burcham created GEODE-8652: ----------------------------------- Summary: member hung in Connection.notifyHandshakeWaiter() during disconnect waiting for a lock held by another thread in Connection.readAck() Key: GEODE-8652 URL: https://issues.apache.org/jira/browse/GEODE-8652 Project: Geode Issue Type: Bug Components: membership, messaging Affects Versions: 1.14.0 Reporter: Bill Burcham
An application encountered the following hang in a TLS-enabled cluster. Let's call the cluster members ds3 -> ds1. ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in {{SocketChannel.read()}} waiting for the acknowledgement. {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut the member down. That thread blocks trying to acquire a lock on the {{NioSslEngine}} held by the first thread (the one doing waiting for the ack to the put-all.) Somehow the shutdown thread must be allowed to proceed. Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) trying to shut down the member but it's stuck waiting for the monitor on the {{NioSslEngine}}: {noformat} "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 tid=0x00007fdb9c011000 nid=0x2fcc waiting for monitor entry [0x00007fdb6f4b7000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) - waiting to lock <0x00000000f2635b28> (a org.apache.geode.internal.net.NioSslEngine) at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) at org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) at org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) at org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) - locked <0x00000000f2678cf8> (a java.util.ArrayList) - locked <0x00000000f1187348> (a java.util.concurrent.ConcurrentHashMap) at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) at org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) - locked <0x00000000f11867a8> (a org.apache.geode.distributed.internal.direct.DirectChannel) at org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) at org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) at org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) at org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) at org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) at org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) at org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) at org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) {noformat} That thread is waiting on a lock held by this thread (in ds3) which is waiting on an acknowledgement to a PutAllPRMessage sent to ds1. {noformat} "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 tid=0x00007fdb9c030800 nid=0x30d1 runnable [0x00007fdb732f0000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) - locked <0x00000000f2643380> (a java.lang.Object) at org.apache.geode.internal.net.NioSslEngine.readAtLeast(NioSslEngine.java:330) at org.apache.geode.internal.tcp.MsgReader.readAtLeast(MsgReader.java:129) at org.apache.geode.internal.tcp.MsgReader.readHeader(MsgReader.java:58) ==> - locked <0x00000000f2635b28> (a org.apache.geode.internal.net.NioSslEngine) at org.apache.geode.internal.tcp.Connection.readAck(Connection.java:2652) at org.apache.geode.distributed.internal.direct.DirectChannel.readAcks(DirectChannel.java:392) at org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:342) at org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182) at org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511) at org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346) at org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291) at org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2053) at org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1981) at org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2018) at org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083) at org.apache.geode.internal.cache.partitioned.PutAllPRMessage.send(PutAllPRMessage.java:201) at org.apache.geode.internal.cache.PartitionedRegion.tryToSendOnePutAllMessage(PartitionedRegion.java:2839) at org.apache.geode.internal.cache.PartitionedRegion.sendMsgByBucket(PartitionedRegion.java:2621) at org.apache.geode.internal.cache.PartitionedRegion.postPutAllSend(PartitionedRegion.java:2392) at org.apache.geode.internal.cache.LocalRegionDataView.postPutAll(LocalRegionDataView.java:361) at org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9154) at org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:8903) {noformat} What we see is that the {{MsgReader}} in the second thread is not letting the first thread close the socket. Until the socket is closed, the second thread will be stuck in {{SocketChannel.read()}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)