[ https://issues.apache.org/jira/browse/GEODE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274531#comment-17274531 ]
Bill Burcham commented on GEODE-8817: ------------------------------------- This hang happens on 1.14.0 (currently on the {{develop}} branch). It does not occur on the latest {{support/1.13}} branch. Bisection on the {{develop}} branch shows the hang happens on the third commit on that branch after the {{support/1.13}} branch was cut. That commit {{02839eae79ae078c30a422b1b0ff3c23866ccdbd}} was one of three on May 4, 2020—the day {{support/1.13}} was cut. I didn't bother to test either of the previous two commits. The hang is still present as of the January 8, 2021 commit {{762db7e86d3969205e2729e4900ff997ca659b74}} Oddly the hang is _not_ happening on recent commits on the {{support/1.13}} branch! How can this be? (it's surprising because, if the problem was introduced before {{support/1.13}} was cut, as it seems to have been, then {{support/1.13}} should exhibit the problem) Bisection on {{support/1.13}} shows that commit {{32a8a55df4c0a5b8b1e5cfb746070804cbce5e0a}} "fixes" this hang. Paradoxically, that commit made to {{support/1.13}} on November 6, 2020: {code} 32a8a55df4c0a5b8b1e5cfb746070804cbce5e0a GEODE-8652 NIOSslEngine.close() bypasses locks cherry-picked from af267c005a63317cbb8528cdb38eccf6a8747818 {code} …is addressing a hang in P2P messaging, whereas this ticket pertains to a hang in client-server messaging. Also paradoxical is that that commit was cherry-picked from the {{develop}} (at that time 1.14.0) branch—but that commit on {{develop}}/1.14.0 does _not_ fix the hang there! To summarize: *the fix for GEODE-8652 fixes the hang on {{support/1.13}} but not on {{develop}}/1.14.0*. How can this be? All I can think of is that there is some _interaction_ between the GEODE-8652 fix that changes the timing (of the server-side of the client-server code) in a way that, due to _other differences_ between the branches, causes the test to hang reliably on {{develop}} but not on {{support/1.13}}. This is not a comforting answer. > server hangs in cache close with ssl enabled due to active client connection; > client side (CacheClientUpdater.close()) is hung in > SSLSocketImpl$AppInputStream.deplete() > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: GEODE-8817 > URL: https://issues.apache.org/jira/browse/GEODE-8817 > Project: Geode > Issue Type: Bug > Components: client/server, security > Affects Versions: 1.14.0 > Reporter: Bill Burcham > Assignee: Bill Burcham > Priority: Major > Labels: blocks-1.14.0 > > A proprietary TLS/SSL-enabled application encountered a network partition. A > server hangs in cache close due to active client connection; client side > ({{CacheClientUpdater.close()}}) is hung in > {{SSLSocketImpl$AppInputStream.deplete()}} > The configuration is: > {noformat} > ========================================== > losingSide | survivingSide > ========================================== > 11110 | 10627 > 11115 | 10632 > ------------------------------------------ > 11139 | 10655 > | 10662 > ------------------------------------------ > {noformat} > The stuck threads were stuck in sun's SSL code. Geode's client/Server > framework uses old I/O and that was also part of where they were stuck. If > the clients had closed their connections to the server then it would not have > been stuck here. But the server shutdown shouldn't hang because of client > that refuses to disconnect. > The Geode client-side of the connection is hung here: > {code:java} > \[warn 2020/11/06 14:56:56.577 PST <ThreadsMonitor> tid=0x18] Thread <50> > (0x32) that was executed at <06 Nov 2020 14:55:43 PST> has been stuck for > <72.81 seconds> and number of thread monitor iteration <1> > Thread Name <poolTimer-brclient-7> state <BLOCKED> > Waiting on <sun.security.ssl.SSLSocketImpl$AppInputStream@3a9b72d3> > Owned By <Cache Client Updater Thread on > 10.32.108.224(bridgep2_host2_10627:10627)<v3>:41003 port 27636> with ID <43> > Executor Group <ScheduledThreadPoolExecutorWithKeepAlive> > Monitored metric <ResourceManagerStats.numThreadsStuck> > Thread stack: > sun.security.ssl.SSLSocketImpl$AppInputStream.deplete(SSLSocketImpl.java:1016) > sun.security.ssl.SSLSocketImpl$AppInputStream.access$100(SSLSocketImpl.java:816) > sun.security.ssl.SSLSocketImpl.bruteForceCloseInput(SSLSocketImpl.java:702) > sun.security.ssl.SSLSocketImpl.duplexCloseOutput(SSLSocketImpl.java:553) > sun.security.ssl.SSLSocketImpl.close(SSLSocketImpl.java:485) > org.apache.geode.internal.cache.tier.sockets.CacheClientUpdater.close(CacheClientUpdater.java:546) > org.apache.geode.cache.client.internal.QueueConnectionImpl.internalDestroy(QueueConnectionImpl.java:112) > org.apache.geode.cache.client.internal.QueueManagerImpl.endpointCrashed(QueueManagerImpl.java:379) > org.apache.geode.cache.client.internal.QueueManagerImpl.connectionCrashed(QueueManagerImpl.java:357) > org.apache.geode.cache.client.internal.QueueConnectionImpl.destroy(QueueConnectionImpl.java:88) > org.apache.geode.cache.client.internal.OpExecutorImpl.handleException(OpExecutorImpl.java:645) > org.apache.geode.cache.client.internal.OpExecutorImpl.handleException(OpExecutorImpl.java:504) > org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:334) > org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:303) > org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:839) > org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:38) > org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90) > org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1329) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:279) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > java.lang.Thread.run(Thread.java:748) > Lock owner thread stack > java.net.SocketInputStream.socketRead0(Native Method) > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > java.net.SocketInputStream.read(SocketInputStream.java:171) > java.net.SocketInputStream.read(SocketInputStream.java:141) > sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:475) > sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:469) > sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:69) > sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1228) > sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:75) > sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:915) > org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:830) > org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:677) > org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1140) > org.apache.geode.internal.cache.tier.sockets.CacheClientUpdater.processMessages(CacheClientUpdater.java:1591) > org.apache.geode.internal.cache.tier.sockets.CacheClientUpdater.run(CacheClientUpdater.java:483) > {code} > From the Geode log we can see the expected exceptions for messages to losing > side (Operation not permitted) after the network is dropped: > {code:java} > [warn 2020/11/06 14:50:20.865 PST <Geode Failure Detection thread 3> > tid=0x9b] Unable to send message to > 10.32.110.94(bridgep1_host1_11115:11115)<v4>:41001 > java.io.IOException: Operation not permitted (sendto failed) > at java.net.PlainDatagramSocketImpl.send(Native Method) > at java.net.DatagramSocket.send(DatagramSocket.java:693) > at org.jgroups.protocols.UDP._send(UDP.java:224) > at org.jgroups.protocols.UDP.sendUnicast(UDP.java:215) > at org.jgroups.protocols.TP.sendToSingleMember(TP.java:1985) > at org.jgroups.protocols.TP.doSend(TP.java:1962) > at > org.apache.geode.distributed.internal.membership.gms.messenger.Transport.doSend(Transport.java:85) > at org.jgroups.protocols.TP.send(TP.java:1948) > at > org.apache.geode.distributed.internal.membership.gms.messenger.Transport._send(Transport.java:57) > at org.jgroups.protocols.TP.down(TP.java:1515) > at org.jgroups.stack.Protocol.down(Protocol.java:439) > at > org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.down(StatRecorder.java:87) > at org.jgroups.protocols.UNICAST3.down(UNICAST3.java:646) > at org.jgroups.protocols.FlowControl.down(FlowControl.java:347) > at org.jgroups.protocols.FRAG2.down(FRAG2.java:136) > at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1039) > at org.jgroups.JChannel.down(JChannel.java:790) > at org.jgroups.JChannel.send(JChannel.java:426) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.send(JGroupsMessenger.java:819) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.send(JGroupsMessenger.java:670) > at > org.apache.geode.distributed.internal.membership.gms.fd.GMSHealthMonitor.doCheckMember(GMSHealthMonitor.java:504) > at > org.apache.geode.distributed.internal.membership.gms.fd.GMSHealthMonitor.lambda$checkMember$1(GMSHealthMonitor.java:455) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)