gaurav-narula opened a new pull request, #16354: URL: https://github.com/apache/kafka/pull/16354
We observed some runs of the test suite caused CI pipelines to stall. A thread dump revealed that the test runner was blocked trying to read from a socket, while attempting to close the socket [[0]]. It turns out this is due to a bug in JDK which is very similar to [JDK-8274524](https://bugs.openjdk.org/browse/JDK-8274524), but it affects the else branch of `SSLSocketImpl::bruteForceCloseInput` [[1]] which wasn't fixed in JDK-8274524. Since the blocking happens in a native call, the test runner's timeouts have no effect as the blocked test runner thread doesn't seem to respond to interrupts. As a mitigation in Kafka's test suite, this change adds `SO_TIMEOUT` of 30 seconds to all the TLS sockets handled by `EchoServer`. The timeout is reasonably high for tests and a finite upper bound avoids infinite blocking of the test suite. [0]: https://issues.apache.org/jira/secure/attachment/13066427/timeout.log [1]: https://github.com/openjdk/jdk/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/sun/security/ssl/SSLSocketImpl.java#L808 ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
