On Thu, Nov 12, 2020 at 10:32 AM Mark Thomas <ma...@apache.org> wrote:

> On 11/11/2020 22:37, Rémy Maucherat wrote:
> > On Wed, Nov 11, 2020 at 9:44 PM <ma...@apache.org> wrote:
> >
> >> This is an automated email from the ASF dual-hosted git repository.
> >>
> >> markt pushed a commit to branch master
> >> in repository https://gitbox.apache.org/repos/asf/tomcat.git
> >>
> >>
> >> The following commit(s) were added to refs/heads/master by this push:
> >>      new 45aeed6  Fix NIO concurrency issue that removes connections
> from
> >> the poller.
> >> 45aeed6 is described below
> >>
> >> commit 45aeed655771308d5185d9dbab8e29a73d87509b
> >> Author: Mark Thomas <ma...@apache.org>
> >> AuthorDate: Wed Nov 11 20:43:04 2020 +0000
> >>
> >>     Fix NIO concurrency issue that removes connections from the poller.
> >>
> >>     This is the source of the intermittent WebSocket test failure so
> this
> >>     commit also removes the associated debug code for that issue.
> >>
> >
> > Great fix. I never expected this one ...
>
> Thanks. It took me long enough to find it.
>
> It only occurred every 1 in ~15 test runs. When a full test run takes
> ~8.5 mins it is a slow process. I was trying to narrow down the set of
> tests that triggered it but it was hard to determine if the failure was
> still being triggered. After about a day of getting nowhere I decided to
> start from the other end and ran the single test in a loop until it
> failed. That meant I could reproduce the failure in less than a minute.
> Things moved a lot faster from that point.
>
> Once I could reproduce the issue, it was just a case of adding debug
> statements to track down the root cause. Some of those statements
> altered the timing enough to prevent the failure but even that helped as
> it meant the issue was occurring after that point.
>
> The root cause surprised me as well. I'd suspected some sort of issue
> along these lines and had been looking at the source code during the
> longer test runs. Knowing NioChannel instances were being re-used I'd
> explicitly looked for places were this sort of mix-up could happen and
> completely failed to find this one.
>
> It will be interesting to see if any other intermittent issues disappear
> suggesting they had the same root cause.
>

That's interesting, I'll review some more with that closed channel behavior
in mind.

Rémy

Reply via email to