https://bz.apache.org/bugzilla/show_bug.cgi?id=66508
Bug ID: 66508 Summary: Tomcat after a GC pause causes the HTTP threads to be blocked to acquire a semaphore to process WebSockets connection closure. Product: Tomcat 8 Version: Tomcat 8.5.84 Hardware: All OS: All Status: NEW Severity: normal Priority: P2 Component: WebSocket Assignee: dev@tomcat.apache.org Reporter: mvssrmur...@gmail.com Target Milestone: ---- For this issue to happen there are three events that need to happen and they are listed below: Event 1. Tomcat is going through a GC Pause. Event 2. WebSocket connection at the client is closed (not processed by the tomcat as it is going through a GC pause) Event 3. Tomcat is trying to write a message on the same connection as above but has not started as the tomcat is going through a pause. Event 2 happens on an HTTP thread (In the below stack trace it is http-nio2-8082-exec-7) and Event 3 happens on a background thread (pool-12-thread-1). If Event 3 happens before Event 2, then the semaphore acquired by the event 3 thread is not released and the event 2 thread is blocked for a duration of org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT (default value is 20 seconds). If I am not wrong event 3 thread never releases the semaphore. If there are 5000 connections on a Tomcat and all going through the above issue, causes all the Tomcat HTTP thread pools to stuck for a long duration not processing any incoming HTTP requests. In a way, each tomcat HTTP thread tries to process the socket close and waits to acquire the semaphore for 20 seconds. With a thread pool of 50, and 5000 connections then it takes 100 * 20 Seconds (default time out) = 2000 seconds = 33.33 Minutes. In our environment, we are using CometD (https://cometd.org/) to support transport independent client-to-server and server-to-client communication via Bayeux protocol. This protocol sends heartbeat messages every 30 seconds to detect the availability of both parties. If the server does not respond within 30 seconds + 10 seconds (Configurable grace period) the client considers the server is not available and re-connects by opening a new connection and closing the old one. Due to a GC Pause at the server, the client won't receive a heartbeat message within a certain interval and is closing the connection. The server when it processes this socket close message writes a response back to the client about it (not sure why it is needed or even if it is needed why it is trying to acquire semaphore with max waiting for org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT). Please review below stack traces. The main issue is http-ni02-8082-exec-7 thread is trying to acquire a semaphore for 20 seconds and pool-12-thread-1 trying to write a message never releases the lock. My debugging points out that the pool-12-thread-1 in our environment is doing a nonblocking write and for some reason, this never completes and never calls to release the semaphore. Let me know if you need any further information. Event 2 Stack Trace: "http-nio2-8082-exec-7" #112 daemon prio=5 os_prio=31 tid=0x00007fc9c07e9000 nid=0x1f403 waiting on condition [0x0000700008eb7000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000748319390> (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409) at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:299) at org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:262) at org.apache.tomcat.websocket.WsSession.sendCloseMessage(WsSession.java:816) at org.apache.tomcat.websocket.WsSession.onClose(WsSession.java:739) - locked <0x000000074831a820> (a java.lang.Object) at org.apache.tomcat.websocket.WsFrameBase.processDataControl(WsFrameBase.java:367) at org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:296) at org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:133) at org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:85) at org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:183) at org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:162) at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:157) at org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:60) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:831) at org.apache.tomcat.util.net.Nio2Endpoint$SocketProcessor.doRun(Nio2Endpoint.java:1615) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) - locked <0x0000000748310ca8> (a org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper) at org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1111) at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:564) at org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:542) at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126) at sun.nio.ch.Invoker$2.run(Invoker.java:218) at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112) at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) Event 3 Stack Trace (made changes to the tomcat code to print the stack trace) and is happening on pool-12-thread-1 java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.onWritePossible(WsRemoteEndpointImplServer.java:175) org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:81) org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:545) org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessage(WsRemoteEndpointImplBase.java:411) org.apache.tomcat.websocket.WsRemoteEndpointImplBase$TextMessageSendHandler.write(WsRemoteEndpointImplBase.java:881) org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendStringByCompletion(WsRemoteEndpointImplBase.java:216) org.apache.tomcat.websocket.WsRemoteEndpointAsync.sendText(WsRemoteEndpointAsync.java:47) org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:118) org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:52) org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:196) org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$Flusher.process(AbstractWebSocketTransport.java:627) org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241) org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223) org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:420) org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:522) org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.schedule(AbstractWebSocketTransport.java:500) org.cometd.websocket.server.WebSocketTransport$WebSocketScheduler$1.schedule(WebSocketTransport.java:154) org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$MetaConnectReplyTask.run(AbstractWebSocketTransport.java:548) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266) java.util.concurrent.FutureTask.run(FutureTask.java) Steps to Reproduce: ------------------- Tomcat version:8.5.72 Java Version: 1.8 (If I am not wrong this issue will happen on higher java versions and also tomcat versions too) 1. Create a WebSocket Endpoint. 2. Let the client connect to the server and make the server send messages at a rate of 1 message per 5 seconds, if not able to reproduce it increase the rate. 3. Either create a GC Pause if that is not possible keep the IDE in debug mode and pause all threads at a breakpoint. 4. Close the client connection. 5. Unpause the debug mode. 6. Take a thread dump. To understand the problem at scale, increase the number of connections to 1000 or 5000 and perform the above steps. I made a change to the following line. If opCode is OPCODE_CLOSE instead of waiting to acquire semaphore with time out, just do a tryAcquire. This fixes the issue. But not sure what are the implications of this change. Ideally, we need to find out why pool-12-thread-1 (listed above) flow is not releasing the lock when the socket is closed at the client or not even failing to write and release the semaphore. https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/WsRemoteEndpointImplBase.java#L299 if (opCode == Constants.OPCODE_CLOSE) { if (!messagePartInProgress.tryAcquire()) { String msg = sm.getString("wsRemoteEndpoint.acquireTimeout"); wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY, msg), new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg), true); throw new SocketTimeoutException(msg); } } else if (!messagePartInProgress.tryAcquire(timeout, TimeUnit.MILLISECONDS)) { String msg = sm.getString("wsRemoteEndpoint.acquireTimeout"); wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY, msg), new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg), true); throw new SocketTimeoutException(msg); } Let me know if you have any questions. Thank you. -- You are receiving this mail because: You are the assignee for the bug. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org