I've been looking at this in case we need a change in native before I roll the 1.2.19 release.
On 25/11/2018 09:42, Rainer Jung wrote: > I observed that when building tcnative against OpenSSL 1.1.1 I ran into > hangs when talking TLS 1.0 with Tomcat trunk using that tcnative plus > Nio(2). > > A simple "GET /" request eg. send with curl, hangs for 60 seconds after > a successful TLS handshake, then the client ends with an "empty reply > from server". > > You can also reproduce with openssl s_client. The request will hang > until you send another additional empty line (in addition to the usual > empty line ending the request). The additional one will then trigger > another read which will find the old request data and handle it. I also see this with openssl s_client > The problem does not occur with the APR connector. APR and Nio(2) seem > to use very different code paths in tcnative for TLS handling > (sslnetwork.c versus ssl.c). > > I have some understanding of the root cause but currently no good idea > how to fix it. The root cause is incorrect handling of SSL_read when it > returns "0". The OpenSSL man page has a relevant description at [1]. As > observed also in mod_ssl (Apache web server), OpenSSL 1.1.1 behaves > different than older version in that it can return "0", were old > versions returned "-1". That was always documented as a possibility but > in reality now really happens. The tcnative code used by APR handles > this in the native part. The code used by Nio(2) simply returns the > value it gets from SSL_read() and leaves it to the calling Java to > handle that. netty, from which we borrowed the ideas for Java plus > OpenSSL, does include such code in ReferenceCountedOpenSslEngine.java, > especially the SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE handling. > > I could have experimented with their approach, but for some reason there > seems to be another problem that makes it harder. The relevant call to > SSL_read() returns "0", but does not return WANT_READ or WANT_WRITE from > a following SSL_get_error(), but instead "5", which is > SSL_ERROR_SYSCALL. I do not have a good idea, where this comes from. > When tracing system calls, it seems it comes from an EAGAIN in a socket > read, but I am not sure about that. I did not see this. All the error codes I saw were zero (which makes it even harder to figure out a solution). Which OS were you testing? Where exactly did you observe that EAGAIN error? > In our Java code, what happens is a call to unwrap() in OpenSSLEngine. > This call writes I think 146 bytes, then checks > pendingReadableBytesInSSL(). That call in turn calls SSL.readFromSSL() > and gets back "0" (from SSL_read()). Up in unwrap() we then skip the > while loop and finally return with BUFFER_UNDERFLOW. Then we hang, > probably because the data was read by OpenSSL and no more socket event > happens. If I artificially add another call to > pendingReadableBytesInSSL() which triggers another SSL_read(), the hang > does not occur. I have tried various ways to differentiate between "there is some data there somewhere if you just keep trying" and "no, there really isn't any data there" without success so far. > IMHO TLS 1.0 is not such a big problem, but we should at least document > it when we do a new release. > > I might drill down debugging into the native layer checking errno etc. > but I am not sure I will find the time. > > [1]: https://www.openssl.org/docs/man1.1.1/man3/SSL_read.html I'd like to spend a little more time looking at this before I tag the release. Mark --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
