On Thu, Nov 12, 2020 at 10:32 AM Mark Thomas <ma...@apache.org> wrote:
> On 11/11/2020 22:37, Rémy Maucherat wrote: > > On Wed, Nov 11, 2020 at 9:44 PM <ma...@apache.org> wrote: > > > >> This is an automated email from the ASF dual-hosted git repository. > >> > >> markt pushed a commit to branch master > >> in repository https://gitbox.apache.org/repos/asf/tomcat.git > >> > >> > >> The following commit(s) were added to refs/heads/master by this push: > >> new 45aeed6 Fix NIO concurrency issue that removes connections > from > >> the poller. > >> 45aeed6 is described below > >> > >> commit 45aeed655771308d5185d9dbab8e29a73d87509b > >> Author: Mark Thomas <ma...@apache.org> > >> AuthorDate: Wed Nov 11 20:43:04 2020 +0000 > >> > >> Fix NIO concurrency issue that removes connections from the poller. > >> > >> This is the source of the intermittent WebSocket test failure so > this > >> commit also removes the associated debug code for that issue. > >> > > > > Great fix. I never expected this one ... > > Thanks. It took me long enough to find it. > > It only occurred every 1 in ~15 test runs. When a full test run takes > ~8.5 mins it is a slow process. I was trying to narrow down the set of > tests that triggered it but it was hard to determine if the failure was > still being triggered. After about a day of getting nowhere I decided to > start from the other end and ran the single test in a loop until it > failed. That meant I could reproduce the failure in less than a minute. > Things moved a lot faster from that point. > > Once I could reproduce the issue, it was just a case of adding debug > statements to track down the root cause. Some of those statements > altered the timing enough to prevent the failure but even that helped as > it meant the issue was occurring after that point. > > The root cause surprised me as well. I'd suspected some sort of issue > along these lines and had been looking at the source code during the > longer test runs. Knowing NioChannel instances were being re-used I'd > explicitly looked for places were this sort of mix-up could happen and > completely failed to find this one. > > It will be interesting to see if any other intermittent issues disappear > suggesting they had the same root cause. > That's interesting, I'll review some more with that closed channel behavior in mind. Rémy