[ 
https://issues.apache.org/jira/browse/SOLR-13778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000976#comment-17000976
 ] 

Dawid Weiss commented on SOLR-13778:
------------------------------------

Ok, so here it comes. I started from looking at the stack trace of those nested 
"recv failed" exceptions:
{code}
java.net.SocketException: Software caused connection abort: recv failed
   [junit4]   2>        at 
java.base/java.net.SocketInputStream.socketRead0(Native Method)
   [junit4]   2>        at 
java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
{code}
the native method in question is different on Windows and on Linux. On Windows 
the core we're interested in is here:

https://github.com/openjdk/jdk14/blob/f58a8cbed2ba984ceeb9a1ea59f917e3f9530f1e/src/java.base/windows/native/libnet/SocketInputStream.c#L120-L154

The core part is a switch on WSAGetLastError:
{code}
int err = WSAGetLastError();
...
switch (err) {
...
default:
  NET_ThrowCurrent(env, "recv failed");
}
{code}

here is when I needed to recompile the JDK to actually see what the error code 
returned from Winsocks is. And it's this one: 

WSAGetLastError returns 10053 (WSAECONNABORTED)

https://docs.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2

If you take a closer look at the source code of SocketInputStream.c this return 
value is not covered in the switch -- that's why we're getting the exception. 

And here comes the worst part: the reasons as to WHY winsocks aborts a recv 
with this result type are fairly vague.  Microsoft says this:

"Connection reset by peer. An existing connection was forcibly closed by the 
remote host. This normally results if the peer application on the remote host 
is suddenly stopped, the host is rebooted, the host or remote network interface 
is disabled, or the remote host uses a hard close (see setsockopt for more 
information on the SO_LINGER option on the remote socket). This error may also 
result if a connection was broken due to keep-alive activity detecting a 
failure while one or more operations are in progress. Operations that were in 
progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET."

I tried to reproduce the same exception on a simple(r) example but totally 
failed. It SHOULD be possible to get the same exception using plain sockets (no 
SSL involved) but I always ended up getting regular connection-closed... I wish 
I could provide an example of this because it'd be a more concrete proof of the 
problem (and a signal that the JDK implementation should return a socket closed 
exception for this condition). Can't figure out a way to do it though, argh.

Going back to failures in Solr tests: I think the reason is that we shutdown 
jetty in the middle of the test but then reuse the same client that was 
previously connected to an existing instance. If it's an SSL connection then 
there may be SSL comms flying around in addition to user messages and if 
they're issued on a closed socket connection they trigger this enigmatic recv 
failed error.

I think the client should be reinstantiated (or at least any existing 
connections dropped) for the tests to work reliably. If we want a more 
connection-drop resilient client we could try to look into SSLExceptions/ 
SocketException and try to parse the 'recv failed' but I think it makes little 
practical sense and is really hacky. Better to drop the request in real life 
and properly reinitialize the client in tests.


> Windows JDK SSL Test Failure trend: SSLException: Software caused connection 
> abort: recv failed
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-13778
>                 URL: https://issues.apache.org/jira/browse/SOLR-13778
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>         Attachments: dumps-LegacyCloud.zip, logs-2019-12-12-1.zip, 
> recv-multiple-2019-12-18.zip
>
>
> Now that Uwe's jenkins build has been correctly reporting it's build results 
> for my [automated 
> reports|http://fucit.org/solr-jenkins-reports/failure-report.html] to pick 
> up, I've noticed a pattern of failures that indicate a definite problem with 
> using SSL on Windows (even with java 11.0.4
>  )
>  The symptommatic stack traces all contain...
> {noformat}
> ...
>    [junit4]    > Caused by: javax.net.ssl.SSLException: Software caused 
> connection abort: recv failed
>    [junit4]    >        at 
> java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
> ...
>    [junit4]    > Caused by: java.net.SocketException: Software caused 
> connection abort: recv failed
>    [junit4]    >        at 
> java.base/java.net.SocketInputStream.socketRead0(Native Method)
> ...
> {noformat}
> I suspect this may be related to 
> [https://bugs.openjdk.java.net/browse/JDK-8209333] but i have no concrete 
> evidence to back this up.
> I'll post some details of my analysis in comments...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to