https://bz.apache.org/bugzilla/show_bug.cgi?id=66508

            Bug ID: 66508
           Summary: Tomcat after a GC pause causes the HTTP threads to be
                    blocked to acquire a semaphore to process WebSockets
                    connection closure.
           Product: Tomcat 8
           Version: Tomcat 8.5.84
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: WebSocket
          Assignee: dev@tomcat.apache.org
          Reporter: mvssrmur...@gmail.com
  Target Milestone: ----

For this issue to happen there are three events that need to happen and they
are listed below:

Event 1. Tomcat is going through a GC Pause.
Event 2. WebSocket connection at the client is closed (not processed by the
tomcat as it is going through a GC pause)
Event 3. Tomcat is trying to write a message on the same connection as above
but has not started as the tomcat is going through a pause.

Event 2 happens on an HTTP thread (In the below stack trace it is
http-nio2-8082-exec-7) and Event 3 happens on a background thread
(pool-12-thread-1).

If Event 3 happens before Event 2, then the semaphore acquired by the event 3
thread is not released and the event 2 thread is blocked for a duration of
org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT (default value is 20
seconds). If I am not wrong event 3 thread never releases the semaphore.

If there are 5000 connections on a Tomcat and all going through the above
issue, causes all the Tomcat HTTP thread pools to stuck for a long duration not
processing any incoming HTTP requests. In a way, each tomcat HTTP thread tries
to process the socket close and waits to acquire the semaphore for 20 seconds.
With a thread pool of 50, and 5000 connections then it takes 100 * 20 Seconds
(default time out) = 2000 seconds = 33.33 Minutes.


In our environment, we are using CometD (https://cometd.org/) to support
transport independent client-to-server and server-to-client communication via
Bayeux protocol. This protocol sends heartbeat messages every 30 seconds to
detect the availability of both parties. If the server does not respond within
30 seconds + 10 seconds (Configurable grace period) the client considers the
server is not available and re-connects by opening a new connection and closing
the old one. Due to a GC Pause at the server, the client won't receive a
heartbeat message within a certain interval and is closing the connection. The
server when it processes this socket close message writes a response back to
the client about it (not sure why it is needed or even if it is needed why it
is trying to acquire semaphore with max waiting for
org.apache.tomcat.websocket.BLOCKING_SEND_TIMEOUT).

Please review below stack traces. The main issue is http-ni02-8082-exec-7
thread is trying to acquire a semaphore for 20 seconds and pool-12-thread-1
trying to write a message never releases the lock. My debugging points out that
the pool-12-thread-1 in our environment is doing a nonblocking write and for
some reason, this never completes and never calls to release the semaphore.

Let me know if you need any further information.

Event 2 Stack Trace:

"http-nio2-8082-exec-7" #112 daemon prio=5 os_prio=31 tid=0x00007fc9c07e9000
nid=0x1f403 waiting on condition [0x0000700008eb7000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000748319390> (a
java.util.concurrent.Semaphore$NonfairSync)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
        at
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:299)
        at
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendMessageBlock(WsRemoteEndpointImplBase.java:262)
        at
org.apache.tomcat.websocket.WsSession.sendCloseMessage(WsSession.java:816)
        at org.apache.tomcat.websocket.WsSession.onClose(WsSession.java:739)
        - locked <0x000000074831a820> (a java.lang.Object)
        at
org.apache.tomcat.websocket.WsFrameBase.processDataControl(WsFrameBase.java:367)
        at
org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:296)
        at
org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:133)
        at
org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:85)
        at
org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:183)
        at
org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:162)
        at
org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:157)
        at
org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:60)
        at
org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59)
        at
org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:831)
        at
org.apache.tomcat.util.net.Nio2Endpoint$SocketProcessor.doRun(Nio2Endpoint.java:1615)
        at
org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
        - locked <0x0000000748310ca8> (a
org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper)
        at
org.apache.tomcat.util.net.AbstractEndpoint.processSocket(AbstractEndpoint.java:1111)
        at
org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:564)
        at
org.apache.tomcat.util.net.Nio2Endpoint$Nio2SocketWrapper$2.completed(Nio2Endpoint.java:542)
        at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126)
        at sun.nio.ch.Invoker$2.run(Invoker.java:218)
        at
sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)
        at
org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
        at
org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
        at
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)






Event 3 Stack Trace (made changes to the tomcat code to print the stack trace)
and is happening on pool-12-thread-1

java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.onWritePossible(WsRemoteEndpointImplServer.java:175)
org.apache.tomcat.websocket.server.WsRemoteEndpointImplServer.doWrite(WsRemoteEndpointImplServer.java:81)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.writeMessagePart(WsRemoteEndpointImplBase.java:545)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.startMessage(WsRemoteEndpointImplBase.java:411)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase$TextMessageSendHandler.write(WsRemoteEndpointImplBase.java:881)
org.apache.tomcat.websocket.WsRemoteEndpointImplBase.sendStringByCompletion(WsRemoteEndpointImplBase.java:216)
org.apache.tomcat.websocket.WsRemoteEndpointAsync.sendText(WsRemoteEndpointAsync.java:47)
org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:118)
org.cometd.websocket.server.WebSocketTransport.send(WebSocketTransport.java:52)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:196)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$Flusher.process(AbstractWebSocketTransport.java:627)
org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:241)
org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:223)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:420)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.send(AbstractWebSocketTransport.java:522)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler.schedule(AbstractWebSocketTransport.java:500)
org.cometd.websocket.server.WebSocketTransport$WebSocketScheduler$1.schedule(WebSocketTransport.java:154)
org.cometd.websocket.server.common.AbstractWebSocketTransport$AbstractWebSocketScheduler$MetaConnectReplyTask.run(AbstractWebSocketTransport.java:548)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
java.util.concurrent.FutureTask.run(FutureTask.java)





Steps to Reproduce:
-------------------
Tomcat version:8.5.72
Java Version: 1.8
(If I am not wrong this issue will happen on higher java versions and also
tomcat versions too)

1. Create a WebSocket Endpoint.
2. Let the client connect to the server and make the server send messages at a
rate of 1 message per 5 seconds, if not able to reproduce it increase the rate.
3. Either create a GC Pause if that is not possible keep the IDE in debug mode
and pause all threads at a breakpoint.
4. Close the client connection.
5. Unpause the debug mode.
6. Take a thread dump.

To understand the problem at scale, increase the number of connections to 1000
or 5000 and perform the above steps.




I made a change to the following line. If opCode is OPCODE_CLOSE instead of
waiting to acquire semaphore with time out, just do a tryAcquire.
This fixes the issue. But not sure what are the implications of this change.
Ideally, we need to find out why pool-12-thread-1 (listed above)  flow is not
releasing the lock when the socket is closed at the client or not even failing
to write and release the semaphore.

https://github.com/apache/tomcat/blob/8.5.x/java/org/apache/tomcat/websocket/WsRemoteEndpointImplBase.java#L299


            if (opCode == Constants.OPCODE_CLOSE)  {
                if (!messagePartInProgress.tryAcquire()) {
                    String msg =
sm.getString("wsRemoteEndpoint.acquireTimeout");
                    wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY,
msg),
                        new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg),
true);
                    throw new SocketTimeoutException(msg);
                }
            } else if (!messagePartInProgress.tryAcquire(timeout,
TimeUnit.MILLISECONDS)) {
                String msg = sm.getString("wsRemoteEndpoint.acquireTimeout");
                wsSession.doClose(new CloseReason(CloseCodes.GOING_AWAY, msg),
                        new CloseReason(CloseCodes.CLOSED_ABNORMALLY, msg),
true);
                throw new SocketTimeoutException(msg);
            }



Let me know if you have any questions.

Thank you.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Reply via email to