Re: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)

2024-05-28 Thread Mark Thomas

Hi Eric,

I have a some follow-up questions in-line. I have also read the other 
messages in this thread and added a couple of additional questions based 
on what I read in those threads.



On 26/05/2024 02:58, Eric Robinson wrote:

One of our hosting customers is a medical practice using a commercial EMR 
running on tomcat+mysql. It has operated well for over a year, but users have 
suddenly begun experiencing slowness for about an hour at the same time every 
day.


What time does this problem start?

Does it occur every day of the week including weekends?

How does the slowness correlate to:
- request volume
- requests to any particular URL(s)?
- requests from any particular client IP?
- any other attribute of the request?

(I'm trying to see if there is something about the requests that 
triggers the issue.)



During the slow times, we've done all the usual troubleshooting to catch the 
problem in the act. The servers have plenty of power and are not overworked. 
There are no slow database queries. Network connectivity is solid. Tomcat has 
plenty of memory. The numbers of database connections, threads, questions, 
queries, etc., remain steady, without spikes. There is no unusual disk latency. 
We have not found any maintenance tasks running during that timeframe.


I would usually suggest taking three thread dumps approximately 5s apart 
and then diffing them to try and spot "slow moving" threads.


I see you have scripted trigger a thread dump when the slowness hits. If 
you haven't already, please configure it to capture (at least) 3 dumps 
~5 seconds apart.


(If we can spot the slow moving threads we might be able to identify 
what it is that makes them slow moving.)



The customer has another load-balanced tomcat instance on a different physical 
server, and the problem happens on that one, too. The servers were upgraded 
with a new kernel and packages on 4/5/24, but the issue did not appear until 
5/6/24. The vendor enabled a new feature in the customer's software, and the 
problem appeared the next day, but they subsequently disabled the feature, and 
(reportedly) the problem did not go away.


Have you confirmed that the feature really is disabled? Or was it just 
hidden?


Has this feature been enabled for any other customers? If yes, have they 
experienced similar issues?


(It is suspicious that the issue occurred after the feature was 
disabled. I wonder if some elements of that change (e.g. a database 
change) are still in place and causing issues.)



It is worth mentioning that the servers are multi-tenanted, with other 
customers running the same medical application, but the others do not 
experience the slowdowns, even though they are on the same servers.


How does this customer compare, in terms of volume of requests, to other 
customers that are not experiencing this issue.


Is there anything unique or special about the customer experiencing the 
issue? Do they have some custom settings no-one else uses?


(I am trying to figure out if the issue is load related, customer 
specific or something else).



There are no unusual errors in the tomcat or database server logs, EXCEPT this 
one: Java.sql.DriverManager.getConnection


Can we see the full stack trace please.


During the periods of slowness, we see lots of those errors along with a large 
spike in the number of stuck tomcat threads (from 1 or 2 to as high as 100). It 
seems obvious that the threads are stuck because tomcat is waiting on a 
connection to the database. However, tcpdump shows that connectivity to the 
database is perfect at the network and application layers. There are no 
unanswered SYNs, no retransmissions, no half-open connections, no failures to 
allocate TCP ports, no conntrack messages, and no other indications of system 
resource exhaustion. Every time tomcat requests a connection to the DB, it 
completes in less than 1 ms. Ten thousand connection attempts completed 
successfully in about 15 seconds, with zero failures.


It sounds like things might be getting stuck somewhere in or near the 
JDBC driver.


Can you provide the exact version of the JDBC driver you are using?

Can you provide the full database configuration from context.xml (or 
wherever it is configured). Please redact sensitive information such as 
passwords.



We are forced to conclude that some database connection requests are being 
initiated but are not being sent on the wire. The problem seems to be in the 
interaction between tomcat and the database driver, or in the driver itself.


I agree.


Unfortunately, the application vendor is taking the "it's your infrastructure" 
position without providing any evidence or offering suggestions for configuration changes,


I'm sorry to hear that. We'll do what we can to help.


other than to deploy more tomcat instances, which is just shooting in the dark. 
They don't know why the software is throwing 
java.sql.DriverManager.getConnection errors (even though it's their code), and 
they'

RE: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)

2024-05-28 Thread Eric Robinson
Hi Mark,

> -Original Message-
> From: Mark Thomas 
> Sent: Tuesday, May 28, 2024 3:42 AM
> To: users@tomcat.apache.org
> Subject: Re: Database Connection Requests Initiated but Not Sent on the Wire
> (Some, Not All)
>
> Hi Eric,
>
> I have a some follow-up questions in-line. I have also read the other 
> messages in
> this thread and added a couple of additional questions based on what I read in
> those threads.
>
>
> On 26/05/2024 02:58, Eric Robinson wrote:
> > One of our hosting customers is a medical practice using a commercial EMR
> running on tomcat+mysql. It has operated well for over a year, but users have
> suddenly begun experiencing slowness for about an hour at the same time
> every day.
>
> What time does this problem start?
>

It typically starts around 9:15 am EDT and goes until around 10:30 am.

> Does it occur every day of the week including weekends?
>

Most weekdays. There have been 1 or 2 weekdays when it seems that symptom 
inexplicably did not appear. I'm not sure about weekends, as the medical 
practice does not work on those days.

> How does the slowness correlate to:
> - request volume
> - requests to any particular URL(s)?
> - requests from any particular client IP?
> - any other attribute of the request?
>

> (I'm trying to see if there is something about the requests that triggers the
> issue.)
>

We have not seen anything stand out. There are no apparent spikes in request 
volume. The slowness appears to impact all parts of the system (meaning all 
URLs). It manifests for the customer, but we have also seen it when we connect 
to the app internally, behind the firewall and reverse proxy, directly to the 
tomcat server from a workstation connected to the same switch.

> > During the slow times, we've done all the usual troubleshooting to catch the
> problem in the act. The servers have plenty of power and are not overworked.
> There are no slow database queries. Network connectivity is solid. Tomcat has
> plenty of memory. The numbers of database connections, threads, questions,
> queries, etc., remain steady, without spikes. There is no unusual disk 
> latency.
> We have not found any maintenance tasks running during that timeframe.
>
> I would usually suggest taking three thread dumps approximately 5s apart and
> then diffing them to try and spot "slow moving" threads.
>

> I see you have scripted trigger a thread dump when the slowness hits. If you
> haven't already, please configure it to capture (at least) 3 dumps
> ~5 seconds apart.
>
> (If we can spot the slow moving threads we might be able to identify what it 
> is
> that makes them slow moving.)
>

We finished and implemented the script yesterday, so today will be the first 
day that it produces results. It watches the catalina.out file for stuck thread 
detection warnings. When the number of stuck threads exceeds a threshold, then 
it starts doing thread dumps every 60 seconds until the counts drops back down 
below the threshold. The users typically do not complain of slowness until the 
stuck thread count exceeds 20, and during that time the threads often take up 
to a minute or more to complete. It's too late today to change the timings, but 
if it does not produce any actionable intel, we can adjust them tonight.

> > The customer has another load-balanced tomcat instance on a different
> physical server, and the problem happens on that one, too. The servers were
> upgraded with a new kernel and packages on 4/5/24, but the issue did not
> appear until 5/6/24. The vendor enabled a new feature in the customer's
> software, and the problem appeared the next day, but they subsequently
> disabled the feature, and (reportedly) the problem did not go away.
>
> Have you confirmed that the feature really is disabled? Or was it just hidden?
>

The vendor claims that the feature uses a different server and does not send 
requests to the slow ones, so it has been re-enabled at the customer's request. 
We may ask them to disable it again until we get this issue resolved.

> Has this feature been enabled for any other customers? If yes, have they
> experienced similar issues?
>

> (It is suspicious that the issue occurred after the feature was disabled. I 
> wonder
> if some elements of that change (e.g. a database
> change) are still in place and causing issues.)
>

We agree that it is suspicious, but at this point we are forced to give it the 
side-eye. We're not aware of other customers being impacted, but (a) it's a new 
AI-based feature, so not many other customers have it, (b) it is enabled by the 
vendor directly, so we are not in the notification loop, and (c) the problem 
customer is large, with about 800 staff, whereas most other customers are much 
and might not trigger the symptom. Bottom line, we're not *sure*, but we think 
the feature is unrelated, but we'll ask them to disable it anyway.

> > It is worth mentioning that the servers are multi-tenanted, with other
> customers running the same medical app

Re: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)

2024-05-28 Thread Mark Thomas

Hi Eric,

Follow-up observsations and comments in-line.


What time does this problem start?


It typically starts around 9:15 am EDT and goes until around 10:30 am.


Does that match the time of highest request load from the customer? 
Rather than a spike, I'm wondering if the problem is triggered once load 
exceeds some threshold.



We finished and implemented the script yesterday, so today will be the first 
day that it produces results. It watches the catalina.out file for stuck thread 
detection warnings. When the number of stuck threads exceeds a threshold, then 
it starts doing thread dumps every 60 seconds until the counts drops back down 
below the threshold. The users typically do not complain of slowness until the 
stuck thread count exceeds 20, and during that time the threads often take up 
to a minute or more to complete. It's too late today to change the timings, but 
if it does not produce any actionable intel, we can adjust them tonight.


Lets see what that produces and go from there.


The vendor claims that the feature uses a different server and does not send 
requests to the slow ones, so it has been re-enabled at the customer's request. 
We may ask them to disable it again until we get this issue resolved.


Noted.


This customer sends about 1.5 million requests to each load-balanced server 
during a typical production day. Most other customers send much less, often 
only a fraction of that. However, at least one customer sends about 2 million 
to the same server, and they don't see the problem. (I will check if they have 
the AI feature enabled.)


Hmm. Whether that other customer has the AI feature enabled would be an 
interesting data point.



Can we see the full stack trace please.


Here's one example.





 java.lang.Throwable
 at 
org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1252)
 at 
org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1220)




That is *very* interesting. That is the start of a synch block in the 
class loader. It should complete quickly. The full thread dump should 
tell us what is holding the lock. If we are lucky we'll be able to tell 
why the lock is being held for so long.


We might need to reduce the time between thread dumps to figure out what 
the thread that is blocking everything is doing. We'll see.



The app has DB connection details in two places. First, it uses a database 
connection string in a .properties file, as follows. This string handles most 
connections to the DB.

mobiledoc.DBUrl=jdbc:mysql://ha52a:5791
mobiledoc.DBName=mobiledoc_791?useSSL=false&zeroDateTimeBehavior=round&jdbcCompliantTruncation=false&cacheServerConfiguration=true&dumpQueriesOnException=true&tinyInt1isBit=false&useOldAliasMetadataBehavior=true&dontTrackOpenResources=true
mobiledoc.DBUser=
mobiledoc.DBPassword=


OK. That seems unlikely to be using connection pooling although the 
application might be pooling internally.



It also has second DB config specifically for a drug database.


   
   
 
 
 
 c:\out.log
   
   
   
   
   
 
 INSERT_CONTEXT_FACTORY
 INSERT_JNDI_URL
 INSERT_USER_NAME
 INSERT_PASSWORD
 INSERT_LOOKUP_NAME
 com.mysql.jdbc.Driver
 
jdbc:mysql://dbclust54:5791/medispan?sessionVariables=wait_timeout=28800,interactive_timeout=28800
 redacted
 redacted
 10
 5000
   
   
   
   
 true
 0
 1800
   



Hmm. There is a pool size setting there but we can't tell if it is being 
used.



Is that Tomcat 9.0.80 as provided by the ASF?


An explicit answer to this question would be helpful.

In terms of the way forward, we need to see to thread dumps when the 
problem is happening to figure out where the blockage is happening and 
(hopefully) why.


Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



RE: Database Connection Requests Initiated but Not Sent on the Wire (Some, Not All)

2024-05-28 Thread Eric Robinson
Hi Mark,

See comments below.


> -Original Message-
> From: Mark Thomas 
> Sent: Tuesday, May 28, 2024 9:32 AM
> To: Tomcat Users List 
> Subject: Re: Database Connection Requests Initiated but Not Sent on the Wire
> (Some, Not All)
>
> Hi Eric,
>
> Follow-up observsations and comments in-line.
>
> >> What time does this problem start?
> >
> > It typically starts around 9:15 am EDT and goes until around 10:30 am.
>
> Does that match the time of highest request load from the customer?
> Rather than a spike, I'm wondering if the problem is triggered once load
> exceeds some threshold.
>

My nginx proxy console only shows live activity and does not keep a history, 
but I can probably script something to parse the localhost_access logs and 
graph request counts on a per-minute basis. Will work on that.

> > We finished and implemented the script yesterday, so today will be the first
> day that it produces results. It watches the catalina.out file for stuck 
> thread
> detection warnings. When the number of stuck threads exceeds a threshold,
> then it starts doing thread dumps every 60 seconds until the counts drops back
> down below the threshold. The users typically do not complain of slowness 
> until
> the stuck thread count exceeds 20, and during that time the threads often take
> up to a minute or more to complete. It's too late today to change the timings,
> but if it does not produce any actionable intel, we can adjust them tonight.
>
> Lets see what that produces and go from there.
>

Took a bunch of thread and heap dumps during today's painful debacle. Will send 
a link to those as soon as I can.

> > The vendor claims that the feature uses a different server and does not send
> requests to the slow ones, so it has been re-enabled at the customer's 
> request.
> We may ask them to disable it again until we get this issue resolved.
>
> Noted.
>
> > This customer sends about 1.5 million requests to each load-balanced
> > server during a typical production day. Most other customers send much
> > less, often only a fraction of that. However, at least one customer
> > sends about 2 million to the same server, and they don't see the
> > problem. (I will check if they have the AI feature enabled.)
>
> Hmm. Whether that other customer has the AI feature enabled would be an
> interesting data point.

I will ask them right after I send this message. They are usually a little slow 
to respond.

>
> >> Can we see the full stack trace please.
> >
> > Here's one example.
>
> 
>
> >  java.lang.Throwable
> >  at
> org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoa
> derBase.java:1252)
> >  at
> > org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClass
> > LoaderBase.java:1220)
>
> 
>
> That is *very* interesting. That is the start of a synch block in the class 
> loader. It
> should complete quickly. The full thread dump should tell us what is holding 
> the
> lock. If we are lucky we'll be able to tell why the lock is being held for so 
> long.
>
> We might need to reduce the time between thread dumps to figure out what
> the thread that is blocking everything is doing. We'll see.
>
> > The app has DB connection details in two places. First, it uses a database
> connection string in a .properties file, as follows. This string handles most
> connections to the DB.
> >
> > mobiledoc.DBUrl=jdbc:mysql://ha52a:5791
> >
> mobiledoc.DBName=mobiledoc_791?useSSL=false&zeroDateTimeBehavior=rou
> nd
> >
> &jdbcCompliantTruncation=false&cacheServerConfiguration=true&dumpQueri
> > esOnException=true&tinyInt1isBit=false&useOldAliasMetadataBehavior=tru
> > e&dontTrackOpenResources=true
> > mobiledoc.DBUser=
> > mobiledoc.DBPassword=
>
> OK. That seems unlikely to be using connection pooling although the 
> application
> might be pooling internally.
>

Based on lots of previous observation, I don't think they are. The comms 
between the app and DB are choppy, with only about 1-5 queries per TCP 
connection. If they are pooling, they are not doing it aggressively.

> > It also has second DB config specifically for a drug database.
> >
> > 
> >
> >
> >  
> >  
> >  
> >  c:\out.log
> >
> >
> >
> >
> >
> >  
> >
> INSERT_CONTEXT_FACTORY FACTORY>
> >  INSERT_JNDI_URL
> >  INSERT_USER_NAME
> >  INSERT_PASSWORD
> >  INSERT_LOOKUP_NAME
> >  com.mysql.jdbc.Driver
> >
> jdbc:mysql://dbclust54:5791/medispan?sessionVariables=wait_timeout=2
> 8800,interactive_timeout=28800
> >  redacted
> >  redacted
> >  10
> >  5000
> >
> >
> >
> >
> >  true
> >  0
> >  1800
> >
> > 
>
> Hmm. There is a pool size setting there but we can't tell if it is being used.
>
> >> Is that Tomcat 9.0.80 as provided by the ASF?
>
> An explicit answer to this question would be helpful.
>

Didn't mean to seem evasive. Yes, it's from the ASF.


> In terms of the way forwar

Re: PersistentManager and ClassNotFoundException

2024-05-28 Thread Christopher Schultz

Jakub,

On 5/24/24 09:31, Jakub Królikowski wrote:

On Fri, May 24, 2024 at 11:23 AM Mark Thomas  wrote:


Can you provide the simplest web application (with source) that
replications the problem?

Mark


On 23/05/2024 23:45, Jakub Królikowski wrote:

Hi,

I'm working with Tomcat 10.1.

When a user starts using the store in my web application, I save the
ShopCart object on the "cart" session attribute.
I want the "cart" attributes to return to the session after restarting

the

app.


To enable session persistence I added



to the Context. It loads the StandardManager.

And this works fine - after reload / restart the object "ShopCart" is

back

in the session.



I want to experiment with PersistentManager. Tomcat docs says: "
The persistence across restarts provided by the *StandardManager* is a
simpler implementation than that provided by the *PersistentManager*. If
robust, production quality persistence across restarts is required then

the

*PersistentManager* should be used with an appropriate configuration.

"

I hope for a Listener of deserialization of the session attributes.

The new Manager configuration looks like this:







But it doesn't work. After restart I get this exception:


java.lang.ClassNotFoundException: ShopCart

at


org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1332)


at


org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1144)


at java.base/java.lang.Class.forName0(Native Method)

at java.base/java.lang.Class.forName(Class.java:534)

at java.base/java.lang.Class.forName(Class.java:513)

at


org.apache.catalina.util.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:158)


at
java.base/java.io

.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2061)


at
java.base/java.io

.ObjectInputStream.readClassDesc(ObjectInputStream.java:1927)


at
java.base/java.io

.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2252)


at
java.base/java.io

.ObjectInputStream.readObject0(ObjectInputStream.java:1762)


at
java.base/java.io

.ObjectInputStream.readObject(ObjectInputStream.java:540)


at
java.base/java.io

.ObjectInputStream.readObject(ObjectInputStream.java:498)


at


org.apache.catalina.session.StandardSession.doReadObject(StandardSession.java:1198)


at


org.apache.catalina.session.StandardSession.readObjectData(StandardSession.java:831)


at org.apache.catalina.session.FileStore.load(FileStore.java:203)

at

org.apache.catalina.session.StoreBase.processExpires(StoreBase.java:138)


at


org.apache.catalina.session.PersistentManagerBase.processExpires(PersistentManagerBase.java:409)


at


org.apache.catalina.session.ManagerBase.backgroundProcess(ManagerBase.java:587)


at


org.apache.catalina.core.StandardContext.backgroundProcess(StandardContext.java:4787)


at


org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1172)


at


org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1176)


at


org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.processChildren(ContainerBase.java:1176)


at


org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1154)


at


java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)






at


java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)


at


java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)


at


java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)


at


java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)


at


org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)


at java.base/java.lang.Thread.run(Thread.java:1583)


I guess this means that the two managers use ClassLoader differently.
How to get the PersistentManager to work in this case?

Best regards,
--
Jakub Królikowski



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org




Hi Mark,
It seems to me that this can be tested on any application.
In Tomcat 10.1, if any session attribute is an instance of a new public
class (unknown to Tomcat and to Tomcat class loader), implementing
java.io.Serializable,
then on reloading the application PersistanceManager (configured as in the
first message) crashes with ClassNotFoundException. StandardManager works.
I don't know if this problem occurred in earlier versions of Tomcat.

If you fail to reproduce this bug, let me know, I will prepare a simple web
app.


Where is your  configuration located? It *should* be inside 
your  located in META-INF/context.xml in your web application. 
If it's in there, then everything