Daniel P. Berrangé <berra...@redhat.com> writes:

> On Fri, Oct 18, 2024 at 10:51:08AM -0300, Fabiano Rosas wrote:
>> Daniel P. Berrangé <berra...@redhat.com> writes:
>> 
>> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
>> >> Daniel P. Berrangé <berra...@redhat.com> writes:
>> >> 
>> >> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
>> >> >> Recent changes to how we invoke the migration tests have
>> >> >> (intentionally) caused them to not be part of the check-qtest target
>> >> >> anymore. Add the check-migration-quick target so we don't lose
>> >> >> migration code testing in this job.
>> >> >
>> >> > But 'check-migration-quick' is only the subset of migration tests,
>> >> > 'check-migration' is all of the migration tests. So surely this is
>> >> > a massive regressions in covage in CI pipelines.
>> >> 
>> >> I'm not sure it is. There are tests there already for all the major
>> >> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
>> >> tweak migration-quick to cover spots where we think we're losing
>> >> coverage.
>> >
>> > Each of the tests in migration-test  were added for a good reason,
>> > generally to address testing gaps where we had functional regressions
>> > in the past. I don't think its a good idea to stop running such tests
>> > in CI as gating on new contributions. Any time we've had optional
>> > tests in QEMU, we've seen repeated regressions in the area in question.
>> >
>> >> Since our CI offers nothing in terms of reproducibility or
>> >> debuggability, I don't think it's productive to have an increasing
>> >> amount of tests running in CI if that means we'll be dealing with
>> >> timeouts and intermittent crashes constantly.
>> >
>> > Test reliability is a different thing. If a particular test is
>> > flaky, it needs to either be fixed or disabled.
>> 
>> The problem is that in this community the idea of "fix" is: wait until
>> someone with the appropriate skill level and interest stumbles upon the
>> problem on their own and fix it in anger.
>> 
>> For it to be a proper strategy, we'd need to create an issue in gitlab
>> referencing the bug, have a proper reproducer and encourage contributors
>> to work on the issue.
>
> It is policy that we should be creating issues in gitlab for any
> flaky tests. I wouldn't say we've been perfect at that, but we
> should be doing it, and that link ought to be linked in the code
> if we disable the test there.
>
>> - It was disabled in March 2023 and stood there *not testing anything*
>>   while a major refactoring of the test code was happening.
>> 
>> - The test was fixed in June 2023, but not reenabled in fear of getting
>>   flak from the community for breaking CI again (or at least that's the
>>   feeling I got from talking to Juan).
>> 
>> - mapped-ram (which relies entirely on multifd) started being worked on
>>   and I had to enable the test in my own branch to be able to test the
>>   code properly. While disabled, it caught several issues in mapped-ram.
>> 
>> - In October 2023 the test is re-enabled an immediately exposes issues
>>   in the code.
>> 
>> This is how I started working on the migration code. Maybe you can
>> appreciate why I don't feel confident about this fix or disable
>> strategy. It has eaten many hours of my work.
>
> The migration subsystem was definitely suffering from insufficient
> maintainer resources available historically, which is reflected
> in some of the testing problems we've had & largely how I ended
> up spending so much time on migration code too.
>
>> > Looking at its execution time right now, I'd say migration test
>> > is pretty good, considering the permutations we have to target.
>> >
>> > It gets a bad reputation because historically it has been as
>> > much as x20 slower than it is today, and has also struggled
>> > with reliability. The latter is a reflection of the complexity
>> > of migration and and IMHO actually justifies greater testing,
>> > as long as we put in time to address bugs.
>> >
>> > Also we've got one single test program, covering an entire
>> > subsystem in one go, rather than lots of short individual
>> > test programs, so migration unfairly gets blamed for being
>> > slow, when it simply covers alot of functionality in one
>> > program.
>> 
>> And still you think it's not worth it having a separate target for
>> testing migration. FWIW, I also proposed splittling it into multiple
>> meson tests, which you also rejected. It would be so much easier to move
>> all of this into a separate target and let those who want nothing do to
>> with to just ignore it.
>
> In the qtests/meson.build, I see we register separate
> suites - a generic "qtest" suite, and a "qtest-$TARGET"
> suite. What's missing here is a suite for subsystem
> classification, which I guess is more or less what you
> proposed here for migration.
>
> How about (in addition to the idea of splitting migration-test
> into one part run for all targets, and one part run for just
> one target), we generalize this concept to work for any
> subsystem tagging
>
> qtest_subsystems = {
>   'migration-test': ['migration'],
>   'cdrom-test': ['block'],
>   'ahci-test': ['block'],
>   ....
> }

This is interesting because it would allow a test to be considered in
more than one subsystem. There are many tests that invoke migration
themselves.

>
>
> then when registering tests we could do
>
>    suites = ['qtest', 'qtest-' + target_base]
>    foreach subsys: qtest_subsystems.get(test, [])
>      suites += ['qtest-' + subsys, 'qtest-' + target_base + '-' + subsys]

Hopefully meson won't take this as a hint to construct a 1000 characters
long line when invoking the test.

>    endforeach
>
>    test(....
>          suite: suites)
>
> that would give us a way to run just the migration tests, or
> just the migration tests on x86_64, etc, likewise for other
> subsystems we want to tag, while still keeping 'make check-qtest'
> as the full coverage.
>
>> >> > Any tests in tree need to be exercised by CI as the minimum bar
>> >> > to prevent bit rot from merges.
>> >> >
>> >> 
>> >> No disagreement here. But then I'm going to need advice on what to do
>> >> when other maintainers ask us to stop writing migration tests because
>> >> they take too long. I cannot send contributors away nor merge code
>> >> without tests.
>> >
>> > In general, I think it is unreasonable for other maintainers to
>> > tell us to stop adding test coverage for migration, and would
>> > push back against such a request. 
>> 
>> This might be a larger issue in QEMU. I also heard the same back in 2021
>> when doing ppc work: "don't add too many tests because the CI buckles
>> and people get mad". The same with adding too much logging really. We're
>> hostages to the gitlab CI unfortunately.
>
> Yep, we need more investment in our CI resources. There were some
> ideas discussed at KVM Forum that could help with this, which
> should be publicised soon.

Great. Looking forward to those.

>
>
>> > This feels like something that should be amenable to unit testing.
>> > Might need a little re-factoring of migration code to make it
>> > easier to run a subset of its logic in isolation, but that'd
>> > probably be a win anyway, as such work usually makes code cleaner.
>> 
>> I'll invest some time in that. I've no idea how we do unit testing in
>> QEMU.
>
> Mostly starts with the standard glib test program setup, where by
> you create a tests/unit/test-<blah>.c file, with functions for
> each test case that you register with g_test_add, same as qtest.
>
> The key difference from qtest is that you then just directly
> link the test binary to the code to be tested, and call into
> it internal QEMU APIs directly. In this case, it would involve
> linking to the 'libmigration' meson static library object.

Thanks for the introduction. I'll pick some self-contained part of
migration and see how far we are from being able to write unit tests.

>
>
> With regards,
> Daniel

Reply via email to