[ https://issues.apache.org/jira/browse/GEODE-8652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234116#comment-17234116 ]
ASF subversion and git services commented on GEODE-8652: -------------------------------------------------------- Commit 32a8a55df4c0a5b8b1e5cfb746070804cbce5e0a in geode's branch refs/heads/master from Bill Burcham [ https://gitbox.apache.org/repos/asf?p=geode.git;h=32a8a55 ] * GEODE-8652: NioSslEngine.close() Bypasses Locks (#5712) - NioSslEngine.close() proceeds even if readers (or writers) are operating on its ByteBuffers, allowing Connection.close() to close its socket and proceed. - NioSslEngine.close() needed a lock only on the output buffer, so we split what was a single lock into two. Also instead of using synchronized we use a ReentrantLock so we can call tryLock() and time out if needed in NioSslEngine.close(). - Since readers/writers may hold locks on these input/output buffers when NioSslEngine.close() is called a reference count is maintained and the buffers are returned to the pool only when the last user is done. - To manage the locking and reference counting a new AutoCloseable ByteBufferSharing interface is introduced with a trivial implementation: ByteBufferSharingNoOp and a real implementation: ByteBufferSharingImpl. - Added a new unit test, and a new concurrency test for ByteBufferSharingImpl: both ensure that ByteBuffers are returned to the pool exactly once. Added a new DUnit test for the interaction between ByteBufferSharingImpl and NioSslEngine and Connection. Co-authored-by: Bill Burcham <bill.burc...@gmail.com> Co-authored-by: Darrel Schneider <dschnei...@pivotal.io> Co-authored-by: Ernie Burghardt <burghar...@vmware.com> Co-authored-by: Dan Smith <upthewatersp...@apache.org> (cherry picked from commit af267c005a63317cbb8528cdb38eccf6a8747818) > member hung in Connection.notifyHandshakeWaiter() during disconnect waiting > for a lock held by another thread in Connection.readAck() > -------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-8652 > URL: https://issues.apache.org/jira/browse/GEODE-8652 > Project: Geode > Issue Type: Bug > Components: membership, messaging > Affects Versions: 1.12.0, 1.13.0, 1.14.0 > Reporter: Bill Burcham > Assignee: Bill Burcham > Priority: Major > Labels: pull-request-available > Fix For: 1.12.1, 1.14.0, 1.13.1 > > > An application encountered the following hang in a TLS-enabled cluster. > Let's call the cluster members ds3 -> ds1. > ds3 sends a {{PutAllPRMessage}} to ds1 and is stuck in > {{SocketChannel.read()}} waiting for the acknowledgement. > {{ClusterDistributionManager.uncleanShutdown()}} is invoked on ds3 to shut > the member down. That thread blocks trying to acquire a lock on the > {{NioSslEngine}} held by the first thread (the one doing waiting for the ack > to the put-all.) > Somehow the shutdown thread must be allowed to proceed. > Here's the hung thread in ds3 (a.k.a. vm_3_thr_34_dataStore3_host1_8592) > trying to shut down the member but it's stuck waiting for the monitor on the > {{NioSslEngine}}: > {noformat} > "vm_3_thr_34_dataStore3_host1_8592" #795 daemon prio=5 os_prio=0 > tid=0x00007fdb9c011000 nid=0x2fcc waiting for monitor entry > [0x00007fdb6f4b7000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.geode.internal.tcp.Connection.notifyHandshakeWaiter(Connection.java:804) > - waiting to lock <0x00000000f2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at org.apache.geode.internal.tcp.Connection.close(Connection.java:1350) > at > org.apache.geode.internal.tcp.Connection.closePartialConnect(Connection.java:1278) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:612) > at > org.apache.geode.internal.tcp.ConnectionTable.closeCon(ConnectionTable.java:604) > at > org.apache.geode.internal.tcp.ConnectionTable.close(ConnectionTable.java:661) > - locked <0x00000000f2678cf8> (a java.util.ArrayList) > - locked <0x00000000f1187348> (a java.util.concurrent.ConcurrentHashMap) > at org.apache.geode.internal.tcp.TCPConduit.stop(TCPConduit.java:487) > at > org.apache.geode.distributed.internal.direct.DirectChannel.disconnect(DirectChannel.java:644) > - locked <0x00000000f11867a8> (a > org.apache.geode.distributed.internal.direct.DirectChannel) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnectDirectChannel(DistributionImpl.java:631) > at > org.apache.geode.distributed.internal.DistributionImpl.access$200(DistributionImpl.java:82) > at > org.apache.geode.distributed.internal.DistributionImpl$LifecycleListenerImpl.disconnect(DistributionImpl.java:904) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.stop(GMSMembership.java:1908) > at > org.apache.geode.distributed.internal.membership.gms.Services.stop(Services.java:302) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.shutdown(GMSMembership.java:1262) > at > org.apache.geode.distributed.internal.membership.gms.GMSMembership.disconnect(GMSMembership.java:1847) > at > org.apache.geode.distributed.internal.DistributionImpl.disconnect(DistributionImpl.java:501) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.uncleanShutdown(ClusterDistributionManager.java:1291) > {noformat} > That thread is waiting on a lock held by this thread (in ds3) which is > waiting on an acknowledgement to a PutAllPRMessage sent to ds1. > {noformat} > "vm_3_thr_37_dataStore3_host1_8592" #857 daemon prio=5 os_prio=0 > tid=0x00007fdb9c030800 nid=0x30d1 runnable [0x00007fdb732f0000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) > - locked <0x00000000f2643380> (a java.lang.Object) > at > org.apache.geode.internal.net.NioSslEngine.readAtLeast(NioSslEngine.java:330) > at > org.apache.geode.internal.tcp.MsgReader.readAtLeast(MsgReader.java:129) > at > org.apache.geode.internal.tcp.MsgReader.readHeader(MsgReader.java:58) > ==> - locked <0x00000000f2635b28> (a > org.apache.geode.internal.net.NioSslEngine) > at > org.apache.geode.internal.tcp.Connection.readAck(Connection.java:2652) > at > org.apache.geode.distributed.internal.direct.DirectChannel.readAcks(DirectChannel.java:392) > at > org.apache.geode.distributed.internal.direct.DirectChannel.sendToMany(DirectChannel.java:342) > at > org.apache.geode.distributed.internal.direct.DirectChannel.sendToOne(DirectChannel.java:182) > at > org.apache.geode.distributed.internal.direct.DirectChannel.send(DirectChannel.java:511) > at > org.apache.geode.distributed.internal.DistributionImpl.directChannelSend(DistributionImpl.java:346) > at > org.apache.geode.distributed.internal.DistributionImpl.send(DistributionImpl.java:291) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendViaMembershipManager(ClusterDistributionManager.java:2053) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendOutgoing(ClusterDistributionManager.java:1981) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.sendMessage(ClusterDistributionManager.java:2018) > at > org.apache.geode.distributed.internal.ClusterDistributionManager.putOutgoing(ClusterDistributionManager.java:1083) > at > org.apache.geode.internal.cache.partitioned.PutAllPRMessage.send(PutAllPRMessage.java:201) > at > org.apache.geode.internal.cache.PartitionedRegion.tryToSendOnePutAllMessage(PartitionedRegion.java:2839) > at > org.apache.geode.internal.cache.PartitionedRegion.sendMsgByBucket(PartitionedRegion.java:2621) > at > org.apache.geode.internal.cache.PartitionedRegion.postPutAllSend(PartitionedRegion.java:2392) > at > org.apache.geode.internal.cache.LocalRegionDataView.postPutAll(LocalRegionDataView.java:361) > at > org.apache.geode.internal.cache.LocalRegion.basicPutAll(LocalRegion.java:9154) > at > org.apache.geode.internal.cache.LocalRegion.putAll(LocalRegion.java:8903) > {noformat} > What we see is that the {{MsgReader}} in the second thread is not letting the > first thread close the socket. Until the socket is closed, the second thread > will be stuck in {{SocketChannel.read()}}. > *But why is the second thread stuck in {{SocketChannelImpl.read}}? That may > be due to GEODE-8651!* -- This message was sent by Atlassian Jira (v8.3.4#803005)