Hello, I would like to share my learnings on a recent issue I came across where tomcat became unresponsive under load, and couldn't recover without restart. More importantly, it was surprising to find tomcat acceptor thread (http-bio-8080-acceptor-0) disappearing resulting into the unresponsive state of the tomcat/app.
*Env:* Tomcat 7.0.29 on Java 1.7.0_51 on CentOS 6.3, with BIO (JioEndpoint) connector *Issue:* Came across an issue under heavy load (generated through Gatling) where during the load test tomcat/application becomes unresponsive, and doesn't recover even after load test. As a side effect, I see lots of TCP sockets in CLOSE_WAIT state with non-empty Recv-Q. *Observation:* I am able to telnet to the port, even curl works up to the point of "sending request", but then hangs while receiving response from server and eventually fails with "curl: (56) Recv failure: Connection reset by peer". Gatling requests are failing with request timed out errors consistently. It doesn't take more than 10 seconds to reproduce the error. Application is all normal from threads and memory perspective while it's unresponsive. Luckily I was able to successfully reproduce the issue. So took heap dump and thread dump before and after multiple load tests. There were no errors in catalina.out or application logs. Suddenly, noticed that "after" we see application in unresponsive state, thread "http-bio-8080-Acceptor" is not in thread dump, it disappeared. So, thought that if thread is exiting it's run method, then it would end and we won't see it in thread dump. Went back and looked at source code of tomcat 7.0.29, remote debugged with tomcat in following method in *org/apache/tomcat/tomcat-coyote/7.0.29/tomcat-coyote-7.0.29.jar!/org/apache/tomcat/util/net/JIoEndpoint.class* : // --------------------------------------------------- Acceptor Inner Class /** * The background thread that listens for incoming TCP/IP connections and * hands them off to an appropriate processor. */ protected class Acceptor extends AbstractEndpoint.Acceptor { @Override public void run() { int errorDelay = 0; // Loop until we receive a shutdown command while (running) { // Loop if endpoint is paused while (paused && running) { state = AcceptorState.PAUSED; try { Thread.sleep(50); } catch (InterruptedException e) { // Ignore } } if (!running) { break; } state = AcceptorState.RUNNING; try { //if we have reached max connections, wait countUpOrAwaitConnection(); Socket socket = null; try { // Accept the next incoming connection from the server // socket socket = serverSocketFactory.acceptSocket(serverSocket); } catch (IOException ioe) { countDownConnection(); // Introduce delay if necessary errorDelay = handleExceptionWithDelay(errorDelay); // re-throw throw ioe; } // Successful accept, reset the error delay errorDelay = 0; // Configure the socket if (running && !paused && setSocketOptions(socket)) { // Hand this socket off to an appropriate processor if (!processSocket(socket)) { countDownConnection(); // Close socket right away closeSocket(socket); } } else { countDownConnection(); // Close socket right away closeSocket(socket); } } catch (IOException x) { if (running) { log.error(sm.getString("endpoint.accept.fail"), x); } } catch (NullPointerException npe) { if (running) { log.error(sm.getString("endpoint.accept.fail"), npe); } } catch (Throwable t) { ExceptionUtils.handleThrowable(t); *log.error(sm.getString("endpoint.accept.fail"), t);* } } state = AcceptorState.ENDED; } } Looking at the code above where tomcat is accepting socket connections, put debug points within each catch block during load test (with remote debug), and got the execution stopped in last catch (Throwable t) block. The ExecptionUtils.handleThrowable checks for type of throwable and converts them to appropriate type and re-throws. At this point, we do not have any more try/catch nor do we have the log statements (one after the ExceptionUtils.handleThrowable) will be skipped as the handleThrowable method itself is re-throwing, causing the thread to exit out of it's run method and end. I saw (googled) only couple instances of such issue reported, so thought to provide details here for future reference. *Root cause: *Tomcat was getting OutOfMemoryError and going to last catch block. Specific error was *"OutOfMemoryError: Unable to Create New Native Thread*". Tomcat had maxConnections (and max executor thread size) set to 1000, and the underlying linux user had set nproc limit to 1024 (file descriptor limit was set to 65k though). Given that each thread in jvm gets a LWP in OS, the user was running out of the set nproc limit causing the OOM with "Unable to create new native thread" error. The *processSocket* call in acceptor is where the socket is assigned to a thread in executor thread pool, so if there isn't a free worker thread available, it will attempt to create a new thread, while OS denies creating new LWP resulting into the error. *Solution:* Once I found the issue with lower ulimits set for number of processes for users, I ended up solving this by increasing nproc limit in /etc/security/limits.d/90-nproc.conf (which is where user specific nproc ulimits were applied in our case) *Ask/Comment:* First of all, It was disappointing to not find any errors/log (unless my log config was messed up, but I did enable org.apache.tomcat.util to FINEST level, in tomcat/conf/logging.properties, and even saw LimitLatch printing countDown and countUp entries in catalina.out) for this scenario. So, at the very least, would it make sense to swap ExceptionUtils.handleThrowable and log.error in last catch block so that we do see at least error reported? Additionally, it doesn't make sense to "end" the main acceptor thread, so we should report the error in logs but continue accepting new threads. I did check JioEndpoint.Acceptor code in Tomcat 8.0.24 <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.tomcat/tomcat-coyote/8.0.24/org/apache/tomcat/util/net/JIoEndpoint.java> version as well, and see same code which would make this reproducible in Tomcat 8 as well. Appreciate feedback on the expectations/ask above. Thanks, Mihir Patel