[
https://issues.apache.org/jira/browse/SOLR-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000976#comment-17000976
]
Dawid Weiss commented on SOLR-13778:
------------------------------------
Ok, so here it comes. I started from looking at the stack trace of those nested
"recv failed" exceptions:
{code}
java.net.SocketException: Software caused connection abort: recv failed
[junit4] 2> at
java.base/java.net.SocketInputStream.socketRead0(Native Method)
[junit4] 2> at
java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
{code}
the native method in question is different on Windows and on Linux. On Windows
the core we're interested in is here:
https://github.com/openjdk/jdk14/blob/f58a8cbed2ba984ceeb9a1ea59f917e3f9530f1e/src/java.base/windows/native/libnet/SocketInputStream.c#L120-L154
The core part is a switch on WSAGetLastError:
{code}
int err = WSAGetLastError();
...
switch (err) {
...
default:
NET_ThrowCurrent(env, "recv failed");
}
{code}
here is when I needed to recompile the JDK to actually see what the error code
returned from Winsocks is. And it's this one:
WSAGetLastError returns 10053 (WSAECONNABORTED)
https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2
If you take a closer look at the source code of SocketInputStream.c this return
value is not covered in the switch -- that's why we're getting the exception.
And here comes the worst part: the reasons as to WHY winsocks aborts a recv
with this result type are fairly vague. Microsoft says this:
"Connection reset by peer. An existing connection was forcibly closed by the
remote host. This normally results if the peer application on the remote host
is suddenly stopped, the host is rebooted, the host or remote network interface
is disabled, or the remote host uses a hard close (see setsockopt for more
information on the SO_LINGER option on the remote socket). This error may also
result if a connection was broken due to keep-alive activity detecting a
failure while one or more operations are in progress. Operations that were in
progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET."
I tried to reproduce the same exception on a simple(r) example but totally
failed. It SHOULD be possible to get the same exception using plain sockets (no
SSL involved) but I always ended up getting regular connection-closed... I wish
I could provide an example of this because it'd be a more concrete proof of the
problem (and a signal that the JDK implementation should return a socket closed
exception for this condition). Can't figure out a way to do it though, argh.
Going back to failures in Solr tests: I think the reason is that we shutdown
jetty in the middle of the test but then reuse the same client that was
previously connected to an existing instance. If it's an SSL connection then
there may be SSL comms flying around in addition to user messages and if
they're issued on a closed socket connection they trigger this enigmatic recv
failed error.
I think the client should be reinstantiated (or at least any existing
connections dropped) for the tests to work reliably. If we want a more
connection-drop resilient client we could try to look into SSLExceptions/
SocketException and try to parse the 'recv failed' but I think it makes little
practical sense and is really hacky. Better to drop the request in real life
and properly reinitialize the client in tests.
> Windows JDK SSL Test Failure trend: SSLException: Software caused connection
> abort: recv failed
> -----------------------------------------------------------------------------------------------
>
> Key: SOLR-13778
> URL: https://issues.apache.org/jira/browse/SOLR-13778
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Priority: Major
> Attachments: dumps-LegacyCloud.zip, logs-2019-12-12-1.zip,
> recv-multiple-2019-12-18.zip
>
>
> Now that Uwe's jenkins build has been correctly reporting it's build results
> for my [automated
> reports|http://fucit.org/solr-jenkins-reports/failure-report.html] to pick
> up, I've noticed a pattern of failures that indicate a definite problem with
> using SSL on Windows (even with java 11.0.4
> )
> The symptommatic stack traces all contain...
> {noformat}
> ...
> [junit4] > Caused by: javax.net.ssl.SSLException: Software caused
> connection abort: recv failed
> [junit4] > at
> java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
> ...
> [junit4] > Caused by: java.net.SocketException: Software caused
> connection abort: recv failed
> [junit4] > at
> java.base/java.net.SocketInputStream.socketRead0(Native Method)
> ...
> {noformat}
> I suspect this may be related to
> [https://bugs.openjdk.java.net/browse/JDK-8209333] but i have no concrete
> evidence to back this up.
> I'll post some details of my analysis in comments...
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]