[ https://issues.apache.org/jira/browse/SOLR-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000976#comment-17000976 ]
Dawid Weiss commented on SOLR-13778: ------------------------------------ Ok, so here it comes. I started from looking at the stack trace of those nested "recv failed" exceptions: {code} java.net.SocketException: Software caused connection abort: recv failed [junit4] 2> at java.base/java.net.SocketInputStream.socketRead0(Native Method) [junit4] 2> at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115) {code} the native method in question is different on Windows and on Linux. On Windows the core we're interested in is here: https://github.com/openjdk/jdk14/blob/f58a8cbed2ba984ceeb9a1ea59f917e3f9530f1e/src/java.base/windows/native/libnet/SocketInputStream.c#L120-L154 The core part is a switch on WSAGetLastError: {code} int err = WSAGetLastError(); ... switch (err) { ... default: NET_ThrowCurrent(env, "recv failed"); } {code} here is when I needed to recompile the JDK to actually see what the error code returned from Winsocks is. And it's this one: WSAGetLastError returns 10053 (WSAECONNABORTED) https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2 If you take a closer look at the source code of SocketInputStream.c this return value is not covered in the switch -- that's why we're getting the exception. And here comes the worst part: the reasons as to WHY winsocks aborts a recv with this result type are fairly vague. Microsoft says this: "Connection reset by peer. An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET." I tried to reproduce the same exception on a simple(r) example but totally failed. It SHOULD be possible to get the same exception using plain sockets (no SSL involved) but I always ended up getting regular connection-closed... I wish I could provide an example of this because it'd be a more concrete proof of the problem (and a signal that the JDK implementation should return a socket closed exception for this condition). Can't figure out a way to do it though, argh. Going back to failures in Solr tests: I think the reason is that we shutdown jetty in the middle of the test but then reuse the same client that was previously connected to an existing instance. If it's an SSL connection then there may be SSL comms flying around in addition to user messages and if they're issued on a closed socket connection they trigger this enigmatic recv failed error. I think the client should be reinstantiated (or at least any existing connections dropped) for the tests to work reliably. If we want a more connection-drop resilient client we could try to look into SSLExceptions/ SocketException and try to parse the 'recv failed' but I think it makes little practical sense and is really hacky. Better to drop the request in real life and properly reinitialize the client in tests. > Windows JDK SSL Test Failure trend: SSLException: Software caused connection > abort: recv failed > ----------------------------------------------------------------------------------------------- > > Key: SOLR-13778 > URL: https://issues.apache.org/jira/browse/SOLR-13778 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Chris M. Hostetter > Priority: Major > Attachments: dumps-LegacyCloud.zip, logs-2019-12-12-1.zip, > recv-multiple-2019-12-18.zip > > > Now that Uwe's jenkins build has been correctly reporting it's build results > for my [automated > reports|http://fucit.org/solr-jenkins-reports/failure-report.html] to pick > up, I've noticed a pattern of failures that indicate a definite problem with > using SSL on Windows (even with java 11.0.4 > ) > The symptommatic stack traces all contain... > {noformat} > ... > [junit4] > Caused by: javax.net.ssl.SSLException: Software caused > connection abort: recv failed > [junit4] > at > java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127) > ... > [junit4] > Caused by: java.net.SocketException: Software caused > connection abort: recv failed > [junit4] > at > java.base/java.net.SocketInputStream.socketRead0(Native Method) > ... > {noformat} > I suspect this may be related to > [https://bugs.openjdk.java.net/browse/JDK-8209333] but i have no concrete > evidence to back this up. > I'll post some details of my analysis in comments... -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org