Re: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)
Hi Eric, I have a some follow-up questions in-line. I have also read the other messages in this thread and added a couple of additional questions based on what I read in those threads. On 26/05/2024 02:58, Eric Robinson wrote: One of our hosting customers is a medical practice using a commercial EMR running on tomcat+mysql. It has operated well for over a year, but users have suddenly begun experiencing slowness for about an hour at the same time every day. What time does this problem start? Does it occur every day of the week including weekends? How does the slowness correlate to: - request volume - requests to any particular URL(s)? - requests from any particular client IP? - any other attribute of the request? (I'm trying to see if there is something about the requests that triggers the issue.) During the slow times, we've done all the usual troubleshooting to catch the problem in the act. The servers have plenty of power and are not overworked. There are no slow database queries. Network connectivity is solid. Tomcat has plenty of memory. The numbers of database connections, threads, questions, queries, etc., remain steady, without spikes. There is no unusual disk latency. We have not found any maintenance tasks running during that timeframe. I would usually suggest taking three thread dumps approximately 5s apart and then diffing them to try and spot "slow moving" threads. I see you have scripted trigger a thread dump when the slowness hits. If you haven't already, please configure it to capture (at least) 3 dumps ~5 seconds apart. (If we can spot the slow moving threads we might be able to identify what it is that makes them slow moving.) The customer has another load-balanced tomcat instance on a different physical server, and the problem happens on that one, too. The servers were upgraded with a new kernel and packages on 4/5/24, but the issue did not appear until 5/6/24. The vendor enabled a new feature in the customer's software, and the problem appeared the next day, but they subsequently disabled the feature, and (reportedly) the problem did not go away. Have you confirmed that the feature really is disabled? Or was it just hidden? Has this feature been enabled for any other customers? If yes, have they experienced similar issues? (It is suspicious that the issue occurred after the feature was disabled. I wonder if some elements of that change (e.g. a database change) are still in place and causing issues.) It is worth mentioning that the servers are multi-tenanted, with other customers running the same medical application, but the others do not experience the slowdowns, even though they are on the same servers. How does this customer compare, in terms of volume of requests, to other customers that are not experiencing this issue. Is there anything unique or special about the customer experiencing the issue? Do they have some custom settings no-one else uses? (I am trying to figure out if the issue is load related, customer specific or something else). There are no unusual errors in the tomcat or database server logs, EXCEPT this one: Java.sql.DriverManager.getConnection Can we see the full stack trace please. During the periods of slowness, we see lots of those errors along with a large spike in the number of stuck tomcat threads (from 1 or 2 to as high as 100). It seems obvious that the threads are stuck because tomcat is waiting on a connection to the database. However, tcpdump shows that connectivity to the database is perfect at the network and application layers. There are no unanswered SYNs, no retransmissions, no half-open connections, no failures to allocate TCP ports, no conntrack messages, and no other indications of system resource exhaustion. Every time tomcat requests a connection to the DB, it completes in less than 1 ms. Ten thousand connection attempts completed successfully in about 15 seconds, with zero failures. It sounds like things might be getting stuck somewhere in or near the JDBC driver. Can you provide the exact version of the JDBC driver you are using? Can you provide the full database configuration from context.xml (or wherever it is configured). Please redact sensitive information such as passwords. We are forced to conclude that some database connection requests are being initiated but are not being sent on the wire. The problem seems to be in the interaction between tomcat and the database driver, or in the driver itself. I agree. Unfortunately, the application vendor is taking the "it's your infrastructure" position without providing any evidence or offering suggestions for configuration changes, I'm sorry to hear that. We'll do what we can to help. other than to deploy more tomcat instances, which is just shooting in the dark. They don't know why the software is throwing java.sql.DriverManager.getConnection errors (even though it's their code), and they'
RE: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)
Hi Mark, > -Original Message- > From: Mark Thomas > Sent: Tuesday, May 28, 2024 3:42 AM > To: users@tomcat.apache.org > Subject: Re: Database Connection Requests Initiated but Not Sent on the Wire > (Some, Not All) > > Hi Eric, > > I have a some follow-up questions in-line. I have also read the other > messages in > this thread and added a couple of additional questions based on what I read in > those threads. > > > On 26/05/2024 02:58, Eric Robinson wrote: > > One of our hosting customers is a medical practice using a commercial EMR > running on tomcat+mysql. It has operated well for over a year, but users have > suddenly begun experiencing slowness for about an hour at the same time > every day. > > What time does this problem start? > It typically starts around 9:15 am EDT and goes until around 10:30 am. > Does it occur every day of the week including weekends? > Most weekdays. There have been 1 or 2 weekdays when it seems that symptom inexplicably did not appear. I'm not sure about weekends, as the medical practice does not work on those days. > How does the slowness correlate to: > - request volume > - requests to any particular URL(s)? > - requests from any particular client IP? > - any other attribute of the request? > > (I'm trying to see if there is something about the requests that triggers the > issue.) > We have not seen anything stand out. There are no apparent spikes in request volume. The slowness appears to impact all parts of the system (meaning all URLs). It manifests for the customer, but we have also seen it when we connect to the app internally, behind the firewall and reverse proxy, directly to the tomcat server from a workstation connected to the same switch. > > During the slow times, we've done all the usual troubleshooting to catch the > problem in the act. The servers have plenty of power and are not overworked. > There are no slow database queries. Network connectivity is solid. Tomcat has > plenty of memory. The numbers of database connections, threads, questions, > queries, etc., remain steady, without spikes. There is no unusual disk > latency. > We have not found any maintenance tasks running during that timeframe. > > I would usually suggest taking three thread dumps approximately 5s apart and > then diffing them to try and spot "slow moving" threads. > > I see you have scripted trigger a thread dump when the slowness hits. If you > haven't already, please configure it to capture (at least) 3 dumps > ~5 seconds apart. > > (If we can spot the slow moving threads we might be able to identify what it > is > that makes them slow moving.) > We finished and implemented the script yesterday, so today will be the first day that it produces results. It watches the catalina.out file for stuck thread detection warnings. When the number of stuck threads exceeds a threshold, then it starts doing thread dumps every 60 seconds until the counts drops back down below the threshold. The users typically do not complain of slowness until the stuck thread count exceeds 20, and during that time the threads often take up to a minute or more to complete. It's too late today to change the timings, but if it does not produce any actionable intel, we can adjust them tonight. > > The customer has another load-balanced tomcat instance on a different > physical server, and the problem happens on that one, too. The servers were > upgraded with a new kernel and packages on 4/5/24, but the issue did not > appear until 5/6/24. The vendor enabled a new feature in the customer's > software, and the problem appeared the next day, but they subsequently > disabled the feature, and (reportedly) the problem did not go away. > > Have you confirmed that the feature really is disabled? Or was it just hidden? > The vendor claims that the feature uses a different server and does not send requests to the slow ones, so it has been re-enabled at the customer's request. We may ask them to disable it again until we get this issue resolved. > Has this feature been enabled for any other customers? If yes, have they > experienced similar issues? > > (It is suspicious that the issue occurred after the feature was disabled. I > wonder > if some elements of that change (e.g. a database > change) are still in place and causing issues.) > We agree that it is suspicious, but at this point we are forced to give it the side-eye. We're not aware of other customers being impacted, but (a) it's a new AI-based feature, so not many other customers have it, (b) it is enabled by the vendor directly, so we are not in the notification loop, and (c) the problem customer is large, with about 800 staff, whereas most other customers are much and might not trigger the symptom. Bottom line, we're not *sure*, but we think the feature is unrelated, but we'll ask them to disable it anyway. > > It is worth mentioning that the servers are multi-tenanted, with other > customers running the same medical app
Re: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)
Hi Eric, Follow-up observsations and comments in-line. What time does this problem start? It typically starts around 9:15 am EDT and goes until around 10:30 am. Does that match the time of highest request load from the customer? Rather than a spike, I'm wondering if the problem is triggered once load exceeds some threshold. We finished and implemented the script yesterday, so today will be the first day that it produces results. It watches the catalina.out file for stuck thread detection warnings. When the number of stuck threads exceeds a threshold, then it starts doing thread dumps every 60 seconds until the counts drops back down below the threshold. The users typically do not complain of slowness until the stuck thread count exceeds 20, and during that time the threads often take up to a minute or more to complete. It's too late today to change the timings, but if it does not produce any actionable intel, we can adjust them tonight. Lets see what that produces and go from there. The vendor claims that the feature uses a different server and does not send requests to the slow ones, so it has been re-enabled at the customer's request. We may ask them to disable it again until we get this issue resolved. Noted. This customer sends about 1.5 million requests to each load-balanced server during a typical production day. Most other customers send much less, often only a fraction of that. However, at least one customer sends about 2 million to the same server, and they don't see the problem. (I will check if they have the AI feature enabled.) Hmm. Whether that other customer has the AI feature enabled would be an interesting data point. Can we see the full stack trace please. Here's one example. java.lang.Throwable at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1252) at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1220) That is *very* interesting. That is the start of a synch block in the class loader. It should complete quickly. The full thread dump should tell us what is holding the lock. If we are lucky we'll be able to tell why the lock is being held for so long. We might need to reduce the time between thread dumps to figure out what the thread that is blocking everything is doing. We'll see. The app has DB connection details in two places. First, it uses a database connection string in a .properties file, as follows. This string handles most connections to the DB. mobiledoc.DBUrl=jdbc:mysql://ha52a:5791 mobiledoc.DBName=mobiledoc_791?useSSL=false&zeroDateTimeBehavior=round&jdbcCompliantTruncation=false&cacheServerConfiguration=true&dumpQueriesOnException=true&tinyInt1isBit=false&useOldAliasMetadataBehavior=true&dontTrackOpenResources=true mobiledoc.DBUser= mobiledoc.DBPassword= OK. That seems unlikely to be using connection pooling although the application might be pooling internally. It also has second DB config specifically for a drug database. c:\out.log INSERT_CONTEXT_FACTORY INSERT_JNDI_URL INSERT_USER_NAME INSERT_PASSWORD INSERT_LOOKUP_NAME com.mysql.jdbc.Driver jdbc:mysql://dbclust54:5791/medispan?sessionVariables=wait_timeout=28800,interactive_timeout=28800 redacted redacted 10 5000 true 0 1800 Hmm. There is a pool size setting there but we can't tell if it is being used. Is that Tomcat 9.0.80 as provided by the ASF? An explicit answer to this question would be helpful. In terms of the way forward, we need to see to thread dumps when the problem is happening to figure out where the blockage is happening and (hopefully) why. Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
RE: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)
Hi Mark, See comments below. > -Original Message- > From: Mark Thomas > Sent: Tuesday, May 28, 2024 9:32 AM > To: Tomcat Users List > Subject: Re: Database Connection Requests Initiated but Not Sent on the Wire > (Some, Not All) > > Hi Eric, > > Follow-up observsations and comments in-line. > > >> What time does this problem start? > > > > It typically starts around 9:15 am EDT and goes until around 10:30 am. > > Does that match the time of highest request load from the customer? > Rather than a spike, I'm wondering if the problem is triggered once load > exceeds some threshold. > My nginx proxy console only shows live activity and does not keep a history, but I can probably script something to parse the localhost_access logs and graph request counts on a per-minute basis. Will work on that. > > We finished and implemented the script yesterday, so today will be the first > day that it produces results. It watches the catalina.out file for stuck > thread > detection warnings. When the number of stuck threads exceeds a threshold, > then it starts doing thread dumps every 60 seconds until the counts drops back > down below the threshold. The users typically do not complain of slowness > until > the stuck thread count exceeds 20, and during that time the threads often take > up to a minute or more to complete. It's too late today to change the timings, > but if it does not produce any actionable intel, we can adjust them tonight. > > Lets see what that produces and go from there. > Took a bunch of thread and heap dumps during today's painful debacle. Will send a link to those as soon as I can. > > The vendor claims that the feature uses a different server and does not send > requests to the slow ones, so it has been re-enabled at the customer's > request. > We may ask them to disable it again until we get this issue resolved. > > Noted. > > > This customer sends about 1.5 million requests to each load-balanced > > server during a typical production day. Most other customers send much > > less, often only a fraction of that. However, at least one customer > > sends about 2 million to the same server, and they don't see the > > problem. (I will check if they have the AI feature enabled.) > > Hmm. Whether that other customer has the AI feature enabled would be an > interesting data point. I will ask them right after I send this message. They are usually a little slow to respond. > > >> Can we see the full stack trace please. > > > > Here's one example. > > > > > java.lang.Throwable > > at > org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoa > derBase.java:1252) > > at > > org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClass > > LoaderBase.java:1220) > > > > That is *very* interesting. That is the start of a synch block in the class > loader. It > should complete quickly. The full thread dump should tell us what is holding > the > lock. If we are lucky we'll be able to tell why the lock is being held for so > long. > > We might need to reduce the time between thread dumps to figure out what > the thread that is blocking everything is doing. We'll see. > > > The app has DB connection details in two places. First, it uses a database > connection string in a .properties file, as follows. This string handles most > connections to the DB. > > > > mobiledoc.DBUrl=jdbc:mysql://ha52a:5791 > > > mobiledoc.DBName=mobiledoc_791?useSSL=false&zeroDateTimeBehavior=rou > nd > > > &jdbcCompliantTruncation=false&cacheServerConfiguration=true&dumpQueri > > esOnException=true&tinyInt1isBit=false&useOldAliasMetadataBehavior=tru > > e&dontTrackOpenResources=true > > mobiledoc.DBUser= > > mobiledoc.DBPassword= > > OK. That seems unlikely to be using connection pooling although the > application > might be pooling internally. > Based on lots of previous observation, I don't think they are. The comms between the app and DB are choppy, with only about 1-5 queries per TCP connection. If they are pooling, they are not doing it aggressively. > > It also has second DB config specifically for a drug database. > > > > > > > > > > > > > > > > c:\out.log > > > > > > > > > > > > > > > INSERT_CONTEXT_FACTORY FACTORY> > > INSERT_JNDI_URL > > INSERT_USER_NAME > > INSERT_PASSWORD > > INSERT_LOOKUP_NAME > > com.mysql.jdbc.Driver > > > jdbc:mysql://dbclust54:5791/medispan?sessionVariables=wait_timeout=2 > 8800,interactive_timeout=28800 > > redacted > > redacted > > 10 > > 5000 > > > > > > > > > > true > > 0 > > 1800 > > > > > > Hmm. There is a pool size setting there but we can't tell if it is being used. > > >> Is that Tomcat 9.0.80 as provided by the ASF? > > An explicit answer to this question would be helpful. > Didn't mean to seem evasive. Yes, it's from the ASF. > In terms of the way forwar
Re: PersistentManager and ClassNotFoundException
Jakub, On 5/24/24 09:31, Jakub Królikowski wrote: On Fri, May 24, 2024 at 11:23 AM Mark Thomas wrote: Can you provide the simplest web application (with source) that replications the problem? Mark On 23/05/2024 23:45, Jakub Królikowski wrote: Hi, I'm working with Tomcat 10.1. When a user starts using the store in my web application, I save the ShopCart object on the "cart" session attribute. I want the "cart" attributes to return to the session after restarting the app. To enable session persistence I added to the Context. It loads the StandardManager. And this works fine - after reload / restart the object "ShopCart" is back in the session. I want to experiment with PersistentManager. Tomcat docs says: " The persistence across restarts provided by the *StandardManager* is a simpler implementation than that provided by the *PersistentManager*. If robust, production quality persistence across restarts is required then the *PersistentManager* should be used with an appropriate configuration. " I hope for a Listener of deserialization of the session attributes. The new Manager configuration looks like this: But it doesn't work. After restart I get this exception: java.lang.ClassNotFoundException: ShopCart at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1332) at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1144) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:534) at java.base/java.lang.Class.forName(Class.java:513) at org.apache.catalina.util.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:158) at java.base/java.io .ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2061) at java.base/java.io .ObjectInputStream.readClassDesc(ObjectInputStream.java:1927) at java.base/java.io .ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2252) at java.base/java.io .ObjectInputStream.readObject0(ObjectInputStream.java:1762) at java.base/java.io .ObjectInputStream.readObject(ObjectInputStream.java:540) at java.base/java.io .ObjectInputStream.readObject(ObjectInputStream.java:498) at org.apache.catalina.session.StandardSession.doReadObject(StandardSession.java:1198) at org.apache.catalina.session.StandardSession.readObjectData(StandardSession.java:831) at org.apache.catalina.session.FileStore.load(FileStore.java:203) at org.apache.catalina.session.StoreBase.processExpires(StoreBase.java:138) at org.apache.catalina.session.PersistentManagerBase.processExpires(PersistentManagerBase.java:409) at org.apache.catalina.session.ManagerBase.backgroundProcess(ManagerBase.java:587) at org.apache.catalina.core.StandardContext.backgroundProcess(StandardContext.java:4787) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1172) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1176) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1176) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1154) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63) at java.base/java.lang.Thread.run(Thread.java:1583) I guess this means that the two managers use ClassLoader differently. How to get the PersistentManager to work in this case? Best regards, -- Jakub Królikowski - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org Hi Mark, It seems to me that this can be tested on any application. In Tomcat 10.1, if any session attribute is an instance of a new public class (unknown to Tomcat and to Tomcat class loader), implementing java.io.Serializable, then on reloading the application PersistanceManager (configured as in the first message) crashes with ClassNotFoundException. StandardManager works. I don't know if this problem occurred in earlier versions of Tomcat. If you fail to reproduce this bug, let me know, I will prepare a simple web app. Where is your configuration located? It *should* be inside your located in META-INF/context.xml in your web application. If it's in there, then everything