I've been looking at this in case we need a change in native before I
roll the 1.2.19 release.

On 25/11/2018 09:42, Rainer Jung wrote:
> I observed that when building tcnative against OpenSSL 1.1.1 I ran into
> hangs when talking TLS 1.0 with Tomcat trunk using that tcnative plus
> Nio(2).
> 
> A simple "GET /" request eg. send with curl, hangs for 60 seconds after
> a successful TLS handshake, then the client ends with an "empty reply
> from server".
> 
> You can also reproduce with openssl s_client. The request will hang
> until you send another additional empty line (in addition to the usual
> empty line ending the request). The additional one will then trigger
> another read which will find the old request data and handle it.

I also see this with openssl s_client

> The problem does not occur with the APR connector. APR and Nio(2) seem
> to use very different code paths in tcnative for TLS handling
> (sslnetwork.c versus ssl.c).
> 
> I have some understanding of the root cause but currently no good idea
> how to fix it. The root cause is incorrect handling of SSL_read when it
> returns "0". The OpenSSL man page has a relevant description at [1]. As
> observed also in mod_ssl (Apache web server), OpenSSL 1.1.1 behaves
> different than older version in that it can return "0", were old
> versions returned "-1". That was always documented as a possibility but
> in reality now really happens. The tcnative code used by APR handles
> this in the native part. The code used by Nio(2) simply returns the
> value it gets from SSL_read() and leaves it to the calling Java to
> handle that. netty, from which we borrowed the ideas for Java plus
> OpenSSL, does include such code in ReferenceCountedOpenSslEngine.java,
> especially the SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE handling.
> 
> I could have experimented with their approach, but for some reason there
> seems to be another problem that makes it harder. The relevant call to
> SSL_read() returns "0", but does not return WANT_READ or WANT_WRITE from
> a following SSL_get_error(), but instead "5", which is
> SSL_ERROR_SYSCALL. I do not have a good idea, where this comes from.
> When tracing system calls, it seems it comes from an EAGAIN in a socket
> read, but I am not sure about that.

I did not see this. All the error codes I saw were zero (which makes it
even harder to figure out a solution).

Which OS were you testing? Where exactly did you observe that EAGAIN error?

> In our Java code, what happens is a call to unwrap() in OpenSSLEngine.
> This call writes I think 146 bytes, then checks
> pendingReadableBytesInSSL(). That call in turn calls SSL.readFromSSL()
> and gets back "0" (from SSL_read()). Up in unwrap() we then skip the
> while loop and finally return with BUFFER_UNDERFLOW. Then we hang,
> probably because the data was read by OpenSSL and no more socket event
> happens. If I artificially add another call to
> pendingReadableBytesInSSL() which triggers another SSL_read(), the hang
> does not occur.

I have tried various ways to differentiate between "there is some data
there somewhere if you just keep trying" and "no, there really isn't any
data there" without success so far.

> IMHO TLS 1.0 is not such a big problem, but we should at least document
> it when we do a new release.
> 
> I might drill down debugging into the native layer checking errno etc.
> but I am not sure I will find the time.
> 
> [1]: https://www.openssl.org/docs/man1.1.1/man3/SSL_read.html

I'd like to spend a little more time looking at this before I tag the
release.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to