Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job

Daniel P . Berrangé Fri, 18 Oct 2024 07:22:28 -0700

On Fri, Oct 18, 2024 at 10:51:08AM -0300, Fabiano Rosas wrote:
> Daniel P. Berrangé <berra...@redhat.com> writes:
> 
> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote:
> >> Daniel P. Berrangé <berra...@redhat.com> writes:
> >> 
> >> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote:
> >> >> Recent changes to how we invoke the migration tests have
> >> >> (intentionally) caused them to not be part of the check-qtest target
> >> >> anymore. Add the check-migration-quick target so we don't lose
> >> >> migration code testing in this job.
> >> >
> >> > But 'check-migration-quick' is only the subset of migration tests,
> >> > 'check-migration' is all of the migration tests. So surely this is
> >> > a massive regressions in covage in CI pipelines.
> >> 
> >> I'm not sure it is. There are tests there already for all the major
> >> parts of the code: precopy, postcopy, multifd, socket. Besides, we can
> >> tweak migration-quick to cover spots where we think we're losing
> >> coverage.
> >
> > Each of the tests in migration-test  were added for a good reason,
> > generally to address testing gaps where we had functional regressions
> > in the past. I don't think its a good idea to stop running such tests
> > in CI as gating on new contributions. Any time we've had optional
> > tests in QEMU, we've seen repeated regressions in the area in question.
> >
> >> Since our CI offers nothing in terms of reproducibility or
> >> debuggability, I don't think it's productive to have an increasing
> >> amount of tests running in CI if that means we'll be dealing with
> >> timeouts and intermittent crashes constantly.
> >
> > Test reliability is a different thing. If a particular test is
> > flaky, it needs to either be fixed or disabled.
> 
> The problem is that in this community the idea of "fix" is: wait until
> someone with the appropriate skill level and interest stumbles upon the
> problem on their own and fix it in anger.
> 
> For it to be a proper strategy, we'd need to create an issue in gitlab
> referencing the bug, have a proper reproducer and encourage contributors
> to work on the issue.


It is policy that we should be creating issues in gitlab for any
flaky tests. I wouldn't say we've been perfect at that, but we
should be doing it, and that link ought to be linked in the code
if we disable the test there.

> - It was disabled in March 2023 and stood there *not testing anything*
>   while a major refactoring of the test code was happening.
> 
> - The test was fixed in June 2023, but not reenabled in fear of getting
>   flak from the community for breaking CI again (or at least that's the
>   feeling I got from talking to Juan).
> 
> - mapped-ram (which relies entirely on multifd) started being worked on
>   and I had to enable the test in my own branch to be able to test the
>   code properly. While disabled, it caught several issues in mapped-ram.
> 
> - In October 2023 the test is re-enabled an immediately exposes issues
>   in the code.
> 
> This is how I started working on the migration code. Maybe you can
> appreciate why I don't feel confident about this fix or disable
> strategy. It has eaten many hours of my work.

The migration subsystem was definitely suffering from insufficient
maintainer resources available historically, which is reflected
in some of the testing problems we've had & largely how I ended
up spending so much time on migration code too.

> > Looking at its execution time right now, I'd say migration test
> > is pretty good, considering the permutations we have to target.
> >
> > It gets a bad reputation because historically it has been as
> > much as x20 slower than it is today, and has also struggled
> > with reliability. The latter is a reflection of the complexity
> > of migration and and IMHO actually justifies greater testing,
> > as long as we put in time to address bugs.
> >
> > Also we've got one single test program, covering an entire
> > subsystem in one go, rather than lots of short individual
> > test programs, so migration unfairly gets blamed for being
> > slow, when it simply covers alot of functionality in one
> > program.
> 
> And still you think it's not worth it having a separate target for
> testing migration. FWIW, I also proposed splittling it into multiple
> meson tests, which you also rejected. It would be so much easier to move
> all of this into a separate target and let those who want nothing do to
> with to just ignore it.

In the qtests/meson.build, I see we register separate
suites - a generic "qtest" suite, and a "qtest-$TARGET"
suite. What's missing here is a suite for subsystem
classification, which I guess is more or less what you
proposed here for migration.

How about (in addition to the idea of splitting migration-test
into one part run for all targets, and one part run for just
one target), we generalize this concept to work for any
subsystem tagging

qtest_subsystems = {
  'migration-test': ['migration'],
  'cdrom-test': ['block'],
  'ahci-test': ['block'],
  ....
}


then when registering tests we could do

   suites = ['qtest', 'qtest-' + target_base]
   foreach subsys: qtest_subsystems.get(test, [])
     suites += ['qtest-' + subsys, 'qtest-' + target_base + '-' + subsys]
   endforeach

   test(....
         suite: suites)

that would give us a way to run just the migration tests, or
just the migration tests on x86_64, etc, likewise for other
subsystems we want to tag, while still keeping 'make check-qtest'
as the full coverage.

> >> > Any tests in tree need to be exercised by CI as the minimum bar
> >> > to prevent bit rot from merges.
> >> >
> >> 
> >> No disagreement here. But then I'm going to need advice on what to do
> >> when other maintainers ask us to stop writing migration tests because
> >> they take too long. I cannot send contributors away nor merge code
> >> without tests.
> >
> > In general, I think it is unreasonable for other maintainers to
> > tell us to stop adding test coverage for migration, and would
> > push back against such a request. 
> 
> This might be a larger issue in QEMU. I also heard the same back in 2021
> when doing ppc work: "don't add too many tests because the CI buckles
> and people get mad". The same with adding too much logging really. We're
> hostages to the gitlab CI unfortunately.

Yep, we need more investment in our CI resources. There were some
ideas discussed at KVM Forum that could help with this, which
should be publicised soon.


> > This feels like something that should be amenable to unit testing.
> > Might need a little re-factoring of migration code to make it
> > easier to run a subset of its logic in isolation, but that'd
> > probably be a win anyway, as such work usually makes code cleaner.
> 
> I'll invest some time in that. I've no idea how we do unit testing in
> QEMU.

Mostly starts with the standard glib test program setup, where by
you create a tests/unit/test-<blah>.c file, with functions for
each test case that you register with g_test_add, same as qtest.

The key difference from qtest is that you then just directly
link the test binary to the code to be tested, and call into
it internal QEMU APIs directly. In this case, it would involve
linking to the 'libmigration' meson static library object.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 4/4] ci: Add check-migration-quick to the clang job

Reply via email to