[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Owen Nichols updated GEODE-8652: -------------------------------- Fix Version/s: 1.13.1 1.14.0 1.12.1 > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging > Affects Versions: 1.14.0 > Reporter: Bill Burcham > Assignee: Bill Burcham > Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x00007fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x00007fdb6f4b7000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0x00000000f2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0x00000000f2678cf8> (a java.util.ArrayList) > - locked <0x00000000f1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0x00000000f11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x00007fdb9c030800 nid=0x30d1 runnable [0x00007fdb732f0000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) > - locked <0x00000000f2643380> (a java.lang.Object) > at > org.apache.geode.internal.net.NioSslEngine.readAtLeast(NioSslEngine.java:330) > at > org.apache.geode.internal.tcp.MsgReader.readAtLeast(MsgReader.java:129) > at > org.apache.geode.internal.tcp.MsgReader.readHeader(MsgReader.java:58) > ==> - locked <0x00000000f2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at > org.apache.geode.internal.tcp.Connection.readAck(Connection.java:2652) > at > org.apache.geode.distributed.internal.direct.DirectChannel.readAcks(DirectChannel.java:392) > at > org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:342) > at > org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182) > at > org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511) > at > org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346) > at > org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2053) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1981) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2018) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083) > at > org.apache.geode.internal.cache.partitioned.PutAllPRMessage.send(PutAllPRMessage.java:201) > at > org.apache.geode.internal.cache.PartitionedRegion.tryToSendOnePutAllMessage(PartitionedRegion.java:2839) > at > org.apache.geode.internal.cache.PartitionedRegion.sendMsgByBucket(PartitionedRegion.java:2621) > at > org.apache.geode.internal.cache.PartitionedRegion.postPutAllSend(PartitionedRegion.java:2392) > at > org.apache.geode.internal.cache.LocalRegionDataView.postPutAll(LocalRegionDataView.java:361) > at > org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9154) > at > org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:8903) > {noformat} > What we see is that the {{MsgReader}} in the second thread is not letting the > first thread close the socket. Until the socket is closed, the second thread > will be stuck in {{SocketChannel.read()}}. > *But why is the second thread stuck in {{SocketChannelImpl.read}}? That may > be due to GEODE-8651!* -- This message was sent by Atlassian Jira (v8.3.4#803005)