On Mon, Oct 19, 2020 at 3:11 PM Michael Osipov <micha...@apache.org> wrote:
> Am 2020-10-15 um 16:48 schrieb Rémy Maucherat: > > On Thu, Oct 15, 2020 at 2:31 PM Michael Osipov <1983-01...@gmx.net> > wrote: > > > >>> On Wed, Oct 14, 2020 at 6:32 PM Michael Osipov <micha...@apache.org> > >> wrote: > >>> > >>>> Am 2020-10-14 um 12:32 schrieb R=C3=A9my Maucherat: > >>>>> On Tue, Oct 13, 2020 at 8:27 PM Michael Osipov <micha...@apache.org> > >>>> wrote: > >>>>> > >>>>>> Am 2020-10-13 um 16:05 schrieb r...@apache.org: > >>>>>>> This is an automated email from the ASF dual-hosted git repository. > >>>>>>> > >>>>>>> remm pushed a commit to branch 8.5.x > >>>>>>> in repository https://gitbox.apache.org/repos/asf/tomcat.git > >>>>>>> > >>>>>>> > >>>>>>> The following commit(s) were added to refs/heads/8.5.x by this > >> push: > >>>>>>> new 883600b Always retry on a new connection, even when > >> pooli= > >>> ng > >>>>>>> 883600b is described below > >>>>>>> > >>>>>>> commit 883600b8a77c0be93b3a8dc2404e8d1110970ee7 > >>>>>>> Author: remm <r...@apache.org> > >>>>>>> AuthorDate: Tue Oct 13 14:19:54 2020 +0200 > >>>>>>> > >>>>>>> Always retry on a new connection, even when pooling > >>>>>>> > >>>>>>> This keeps the same very simple design as for the single > >>>> connection > >>>>>>> scenario, for now. > >>>>>>> --- > >>>>>>> java/org/apache/catalina/realm/JNDIRealm.java | 22 > >>>>>> +++++++++++++++++++--- > >>>>>>> webapps/docs/changelog.xml | 5 +++++ > >>>>>>> 2 files changed, 24 insertions(+), 3 deletions(-) > >>>>>>> > >>>>>>> diff --git a/java/org/apache/catalina/realm/JNDIRealm.java > >>>>>> b/java/org/apache/catalina/realm/JNDIRealm.java > >>>>>>> index 72087ab..98007f7 100644 > >>>>>>> --- a/java/org/apache/catalina/realm/JNDIRealm.java > >>>>>>> +++ b/java/org/apache/catalina/realm/JNDIRealm.java > >>>>>>> @@ -1311,7 +1311,7 @@ public class JNDIRealm extends RealmBase { > >>>>>>> close(connection); > >>>>>>> > >>>>>>> // open a new directory context. > >>>>>>> - connection =3D get(); > >>>>>>> + connection =3D get(true); > >>>>>>> > >>>>>>> // Try the authentication again. > >>>>>>> principal =3D authenticate(connection, > username, > >>>>>> credentials); > >>>>>>> @@ -2389,12 +2389,28 @@ public class JNDIRealm extends RealmBase { > >>>>>>> * @exception NamingException if a directory server error > >> occu= > >>> rs > >>>>>>> */ > >>>>>>> protected JNDIConnection get() throws NamingException { > >>>>>>> + return get(false); > >>>>>>> + } > >>>>>>> + > >>>>>>> + /** > >>>>>>> + * Open (if necessary) and return a connection to the > >> configured > >>>>>>> + * directory server for this Realm. > >>>>>>> + * @param create when pooling, this forces creation of a new > >>>>>> connection, > >>>>>>> + * for example after an error > >>>>>>> + * @return the connection > >>>>>>> + * @exception NamingException if a directory server error > >> occurs > >>>>>>> + */ > >>>>>>> + protected JNDIConnection get(boolean create) throws > >>>> NamingException > >>>>>> { > >>>>>>> JNDIConnection connection =3D null; > >>>>>>> // Use the pool if available, otherwise use the single > >>>>>> connection > >>>>>>> if (connectionPool !=3D null) { > >>>>>>> - connection =3D connectionPool.pop(); > >>>>>>> - if (connection =3D=3D null) { > >>>>>>> + if (create) { > >>>>>>> connection =3D create(); > >>>>>>> + } else { > >>>>>>> + connection =3D connectionPool.pop(); > >>>>>>> + if (connection =3D=3D null) { > >>>>>>> + connection =3D create(); > >>>>>>> + } > >>>>>>> } > >>>>>>> } else { > >>>>>>> singleConnectionLock.lock(); > >>>>>> > >>>>>> That suitable and simple approach. > >>>>>> > >>>>> > >>>>> If you have the code for adding a max idle check on hand and tested, > >> yo= > >>> u > >>>>> should add it IMO, it will be more efficient. > >>>> > >>>> I will need to give this a couple more weeks of testing. This is what > I > >>>> have observed today: > >>>>> 2020-10-14T16:01:47.039 147.54.155.198 w99sezx0... "GET > >>>> /x2tc-proxy-bln/rest/info/targetSystem HTTP/1.1" 200 8 92132 > >>>>> > >>>>> 20609 2020-10-14T16:00:14 FEIN [https-openssl-apr-8444-exec-166] > >>>> net.sf.michaelo.tomcat.realm.ActiveDirectoryRealm.acquire Acquiring > >>>> directory server connection from pool > >>>>> 20610 2020-10-14T16:00:14 FEIN [https-openssl-apr-8444-exec-166] > >>>> net.sf.michaelo.tomcat.realm.ActiveDirectoryRealm.acquire Directory > >> serve= > >>> r > >>>> connection from pool exceeds max idle time, closing it > >>>>> 20611 2020-10-14T16:00:14 FEIN [https-openssl-apr-8444-exec-166] > >>>> net.sf.michaelo.tomcat.realm.ActiveDirectoryRealm.close Closing > >> directory > >>>> server connection > >>>>> 20612 2020-10-14T16:00:14 FEIN [https-openssl-apr-8444-exec-166] > >>>> net.sf.michaelo.tomcat.realm.ActiveDirectoryRealm.open Opening new > >>>> directory server connection > >>>>> 20613 2020-10-14T16:01:47 FEIN [https-openssl-apr-8444-exec-166] > >>>> net.sf.michaelo.tomcat.realm.ActiveDirectoryRealm.getUser Searching > for > >>>> username 'w99sezx0' in base ... > >>>> > >>>> As you can see it took 90 seconds to server the request because the > >>>> connection has expired and close took way too long. In average the > >>>> request takes: > >>>>> 2020-10-14T13:57:06.730 10.81.50.232 osipovmi@... "GET > >>>> /x2tc-proxy-bln/rest/info/targetSystem HTTP/1.1" 200 8 70 > >>>> > >>>> when the connection is healthy. > >>>> > >>> > >>> Ok, so there's some real incentive to avoid reusing a connection that > was > >>> idle for too long. > >> > >> I made further analysis. I was partially wrong about my statement. The > >> cause > >> for the 90 s was a faulty KDC which did not respond in time for a > service > >> ticket. Java Kerberos does 3 retries with 30 s timeout. I have set to 1 > >> retry > >> and 1000 ms wait until the next KDC will be tried. > >> > >> Anyways, I am still convinced that some idle timeout is the right > choice to > >> avoid resource depletion on both sides. If hundreds of clients keep > tens of > >> connections open to a directory server for no reaon sooner or later > >> everything > >> will stall. I will let you know in a couple of weeks. > >> > > > > Yes, no problem, it sounds like a good idea (with a good default). > > > > > >> > >> But related to this, I have made a very interesting observation when the > >> webapp > >> is shutdown: > >>> 2020-10-15T14:24:15.120 WARNUNG [deblndw024v...-startStop-2] > >> org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads > The > >> web application [x2tc-proxy-dsv] appears to have started a thread named > >> [Thread-7] but has failed to stop it. This is very likely to create a > >> memory leak. Stack trace of thread: > >>> java.lang.Object.wait(Native Method) > >>> java.lang.Object.wait(Object.java:502) > >>> com.sun.jndi.ldap.Connection.pauseReader(Connection.java:804) > >>> com.sun.jndi.ldap.Connection.run(Connection.java:944) > >>> java.lang.Thread.run(Thread.java:748) > >> > >> When you look how LdapCtx#close() is implemented, you'll see that it > holds > >> a > >> reference count to all naming enumrations. As long as they aren't really > >> closed > >> the connection is not closed immediately. I need to analyze why this is > >> happening. The result could be stopping and starting an application > could > >> drown > >> the entire VM sooner or later. > >> > > > > That's not nice at all. I suppose it means the context classloader must > be > > switched when opening the ldap connection or it could leak. This is of > > course going to be less an issue with a single connection. I'll need to > fix > > it. > > I thnk you don't need to do anything, but one thing. I Will explain why. > I did further testing, reviewed my code. All private LDAP classes on > com.sum have a debug flag which can only be enabled when recompiled. I > did that and put the JAR in the boot classpath. The crucial point is > immediately closing a connection requires *all* NamingEnumerations > closed properly. A reference counter is held and if that counter is not > 0, the connection will be scheduled to be closed at some point in time, > here is your webapp classloader leak: > > LdapCtx.removeUnsolicited: false > > LdapCtx: calling clnt.close() com.sun.jndi.ldap.LdapCtx@34d8f0 > > LdapClient: com.sun.jndi.ldap.LdapClient@1b223fc > > LdapClient: close() called: 1 > > java.lang.Throwable > > at com.sun.jndi.ldap.LdapClient.close(LdapClient.java:433) > > at com.sun.jndi.ldap.LdapCtx.closeConnection(LdapCtx.java:2856) > > at com.sun.jndi.ldap.LdapCtx.close(LdapCtx.java:2641) > > 2020-10-19T14:40:34.393 WARNUNG [https-openssl-apr-8444-exec-215] > org.apache.catalina.loader.WebappClassLoaderBase.clearReferencesThreads The > web application [x2tc-proxy-bln] appears to have started a thread named > [Thread-1671] but has failed to stop it. This is very likely to create a > memory leak. Stack trace of thread: > > java.net.SocketInputStream.socketRead0(Native Method) > > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > > java.net.SocketInputStream.read(SocketInputStream.java:172) > > java.net.SocketInputStream.read(SocketInputStream.java:141) > > java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > > java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > > java.io.BufferedInputStream.read(BufferedInputStream.java:345) > > > com.sun.jndi.ldap.sasl.SaslInputStream.readFully(SaslInputStream.java:166) > > com.sun.jndi.ldap.sasl.SaslInputStream.fill(SaslInputStream.java:123) > > com.sun.jndi.ldap.sasl.SaslInputStream.read(SaslInputStream.java:90) > > com.sun.jndi.ldap.Connection.run(Connection.java:848) > > java.lang.Thread.run(Thread.java:748) > > > It should say "called: 0" when properly implemented. This means that all > naming enumeration instances has been closed properly before closing the > connection. Changing the class loader will just move the problem and not > solve it. > > In my case the issue is wihin an application. It uses Spring's > PooledContextSource and retrieves a few attributes for a user from > Active Directory. I have either incorrectly implemented by anonymous > AbstractContextMapper or it is a bug in Spring LDAP. > > So all you have to do is review NamingEnumeration#close() calls. > It looks to me that the NamingEnumeration is closed (finally everywhere). The context is being stopped, but indeed the connections that have been created by the LDAP impl behind the back of the container were created with its context CL. So the CL signals a possible leak. But there's probably something more interesting there: since this is a stack, connections might not be used if they are at the bottom. It could be a good plan to try to close all other connections from the stack when there's an error on one. Rémy