One of our hosting customers is a medical practice using a commercial EMR
running on tomcat+mysql. It has operated well for over a year, but users have
suddenly begun experiencing slowness for about an hour at the same time every
day. During the slow times, we've done all the usual troubleshooting to catch
the problem in the act. The servers have plenty of power and are not
overworked. There are no slow database queries. Network connectivity is solid.
Tomcat has plenty of memory. The numbers of database connections, threads,
questions, queries, etc., remain steady, without spikes. There is no unusual
disk latency. We have not found any maintenance tasks running during that
timeframe.
The customer has another load-balanced tomcat instance on a different physical
server, and the problem happens on that one, too. The servers were upgraded
with a new kernel and packages on 4/5/24, but the issue did not appear until
5/6/24. The vendor enabled a new feature in the customer's software, and the
problem appeared the next day, but they subsequently disabled the feature, and
(reportedly) the problem did not go away. It is worth mentioning that the
servers are multi-tenanted, with other customers running the same medical
application, but the others do not experience the slowdowns, even though they
are on the same servers.
There are no unusual errors in the tomcat or database server logs, EXCEPT this
one: Java.sql.DriverManager.getConnection
During the periods of slowness, we see lots of those errors along with a large
spike in the number of stuck tomcat threads (from 1 or 2 to as high as 100). It
seems obvious that the threads are stuck because tomcat is waiting on a
connection to the database. However, tcpdump shows that connectivity to the
database is perfect at the network and application layers. There are no
unanswered SYNs, no retransmissions, no half-open connections, no failures to
allocate TCP ports, no conntrack messages, and no other indications of system
resource exhaustion. Every time tomcat requests a connection to the DB, it
completes in less than 1 ms. Ten thousand connection attempts completed
successfully in about 15 seconds, with zero failures.
We are forced to conclude that some database connection requests are being
initiated but are not being sent on the wire. The problem seems to be in the
interaction between tomcat and the database driver, or in the driver itself.
Unfortunately, the application vendor is taking the "it's your infrastructure"
position without providing any evidence or offering suggestions for
configuration changes, other than to deploy more tomcat instances, which is
just shooting in the dark. They don't know why the software is throwing
java.sql.DriverManager.getConnection errors (even though it's their code), and
they've relegated the investigation to us.
Any advice from the community would be greatly appreciated.
RHEL 8.9, kernel 4.18.0-513.18.1.el8_9.x86_64
Apache Tomcat/9.0.80, JVM 1.8.0_372-b07
(The tomcat and JVM versions are the ones recommended by the vendor.)
We're standing by to provide whatever other information the community may need.
Thanks tons!
-Eric
Disclaimer : This email and any files transmitted with it are confidential and
intended solely for intended recipients. If you are not the named addressee you
should not disseminate, distribute, copy or alter this email. Any views or
opinions presented in this email are solely those of the author and might not
represent those of Physician Select Management. Warning: Although Physician
Select Management has taken reasonable precautions to ensure no viruses are
present in this email, the company cannot accept responsibility for any loss or
damage arising from the use of this email or attachments.