Re: [log4j] Unstable tests on Windows

Matt Sicker Tue, 21 Nov 2023 14:35:05 -0800

The GH CI build also fails randomly compared to the green builds that I push in 
the first place. Occasionally, this might be a platform-specific issue (which 
is what I’d ideally be discovering from CI), but the failing tests are 
generally fragile ones in the first place. While I have ideas on how to make 
those areas of the codebase more testable, some of them can’t be more readily 
tested without drastic measures. Specifically, our tests around rolling file 
appender things and anything else relying on scheduled actions. 
Besides using an actual file system in the tests (these could run faster if 
using a virtual file system instead like [jimfs][0]), the largest limitation 
here is that we rely on a ScheduledThreadPoolExecutor for scheduling actions 
like rollovers, and this class relies directly on System::nanoTime for 
scheduling decisions, so the only way to mock the clock here is to either run 
the test in a modified runtime where the system calls behind nanoTime are 
mocked or to find a similar OSS implementation (or adapt from the JDK here if 
the classes are public domain like the Doug Lea ones usually are) to use 
instead which allows for controlling the clock. The latter option is something 
I considered before, but it’s not something to be taken lightly since we 
already expect the platform version of ScheduledThreadPoolExecutor to be stable 
and correct (or hopefully correct; this implementation is more likely to be 
thoroughly reviewed compared to something custom). In fact, this alternative 
path is basically porting or reimplementing the core of Quartz, so there is 
some reference code to adapt there come to think of it.


On the file system side, this would require either more direct adoption of the 
NIO2 API rather than the original IO API, or we need a similar abstraction to 
make mocking a file system easier to accomplish. Then we can have separate 
integration tests that use a real filesystem that could be more predictable if 
necessary.

Suffice to say, there’s usually a reason why I get into refactoring in this 
project, and most of the time, it involves making it easier to test and extend 
things both in the project and as a downstream user. The rolling appender area 
is something I’d like to get to sometime, but we’ve got about ten years worth 
of development to untangle (which we’ve largely accomplished in the main branch 
already with the modules there, but we can still break out more modules that 
involve more complicated refactoring).

[0]: https://github.com/google/jimfs
—
Matt Sicker

> On Nov 20, 2023, at 08:40, Gary Gregory <garydgreg...@gmail.com> wrote:
> 
> The GH CI builds on every push (as opposed to commi) IIRC.
> 
> Gary
> 
> On Mon, Nov 20, 2023, 9:34 AM Ralph Goers <ralph.go...@dslextreme.com>
> wrote:
> 
>> Gary uses Windows as his development OS. He is probably the only one of us
>> who does. So he inevitably finds these issues before the rest of us.
>> 
>> I don’t know if the CI has a build that runs on Windows for every commit.
>> 
>> Ralph
>> 
>>> On Nov 20, 2023, at 6:12 AM, Robert Middleton <rmiddle...@apache.org>
>> wrote:
>>> 
>>> Are the tests run on Windows through Github workflows?  It doesn't
>>> look like it to me.
>>> 
>>> If you need access to a Windows machine, you can download a
>>> development VM straight from Microsoft:
>>> 
>> https://developer.microsoft.com/en-us/windows/downloads/virtual-machines/
>>> 
>>> -Robert Middleton
>>> 
>>> On Mon, Nov 20, 2023 at 7:40 AM Apache <ralph.go...@dslextreme.com>
>> wrote:
>>>> 
>>>> In my experience they never get fixed. To be honest, when I was doing
>> the releases I would have these failures investigated to determine if it
>> was a trait problem vs a problem in the code being released. If it was the
>> latter I would cancel the vote. The only time tests should be disabled is
>> if we know it is a problem in the test but can’t figure out how to fix it.
>>>> 
>>>> I also don’t ever recall Gary ever having more than one or two tests
>> fail in a run.
>>>> 
>>>> Ralph
>>>> 
>>>>> On Nov 20, 2023, at 5:00 AM, Volkan Yazıcı <vol...@yazi.ci> wrote:
>>>>> 
>>>>> I am not asking to disable Windows tests. I am asking to disable tests
>>>>> and only those tests that have a failure rate on Windows higher than,
>>>>> say, 30%. To be precise, I think there are 2-3 of them dealing with
>>>>> network sockets and rolling file appenders. I am not talking about
>>>>> dozens or such.
>>>>> 
>>>>> After disabling them, we can create a ticket referencing them. So that
>>>>> interested parties can fix them.
>>>>> 
>>>>>> On Mon, Nov 20, 2023 at 12:25 PM Piotr P. Karwasz
>>>>>> <piotr.karw...@gmail.com> wrote:
>>>>>> 
>>>>>> Hi Volkan,
>>>>>> 
>>>>>>> On Mon, 20 Nov 2023 at 09:36, Volkan Yazıcı <vol...@yazi.ci> wrote:
>>>>>>> 
>>>>>>> As Gary (the only Windows user among the active Log4j maintainers,
>>>>>>> AFAIK) has noticed several times, Log4j tests on Windows are pretty
>>>>>>> unstable. It not only fails on Gary's laptop, but Piotr and I need to
>>>>>>> give Windows tests in CI a kick on a regular basis. Approximately one
>>>>>>> out of three CI runs fails on Windows. Piotr already improved the
>>>>>>> situation extensively, though there are still several leftovers that
>>>>>>> need attention.
>>>>>>> 
>>>>>>> Unless somebody steps up to improve the unstable Windows tests, I
>>>>>>> would like to disable those only for the WIndows platform.
>>>>>> 
>>>>>> Please don't. Windows has an annoying file locking policy that
>>>>>> prevents users from deleting files with open file descriptors, but
>>>>>> that is one of the few ways to detect resource leakage we have.
>>>>>> 
>>>>>> Tests running on *NIXes will ignore problems with open file
>>>>>> descriptors and delete the log files, but on a production system those
>>>>>> leaks will accumulate and cause application crashes. We had such a
>>>>>> leak, when we used `URLConnection#getLastModified` on a `jar:...` URL.
>>>>>> This call caused file descriptor exhaustion on both Windows and
>>>>>> *NIXes, but only the Windows test was able to detect it.
>>>>>> 
>>>>>> Piotr,
>>>>>> who never thought would ever defend Microsoft Windows.
>>>>>> 
>>>>>> PS: Gary reports the failures, but always runs the build again until
>>>>>> it succeeds, even on Friday 13th, when he had to wait until Saturday
>>>>>> 14th for the test run to succeed.
>>>> 
>> 
>>

Re: [log4j] Unstable tests on Windows

Reply via email to