Daniel P. Berrangé <berra...@redhat.com> writes: > On Fri, Oct 18, 2024 at 10:51:08AM -0300, Fabiano Rosas wrote: >> Daniel P. Berrangé <berra...@redhat.com> writes: >> >> > On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote: >> >> Daniel P. Berrangé <berra...@redhat.com> writes: >> >> >> >> > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote: >> >> >> Recent changes to how we invoke the migration tests have >> >> >> (intentionally) caused them to not be part of the check-qtest target >> >> >> anymore. Add the check-migration-quick target so we don't lose >> >> >> migration code testing in this job. >> >> > >> >> > But 'check-migration-quick' is only the subset of migration tests, >> >> > 'check-migration' is all of the migration tests. So surely this is >> >> > a massive regressions in covage in CI pipelines. >> >> >> >> I'm not sure it is. There are tests there already for all the major >> >> parts of the code: precopy, postcopy, multifd, socket. Besides, we can >> >> tweak migration-quick to cover spots where we think we're losing >> >> coverage. >> > >> > Each of the tests in migration-test were added for a good reason, >> > generally to address testing gaps where we had functional regressions >> > in the past. I don't think its a good idea to stop running such tests >> > in CI as gating on new contributions. Any time we've had optional >> > tests in QEMU, we've seen repeated regressions in the area in question. >> > >> >> Since our CI offers nothing in terms of reproducibility or >> >> debuggability, I don't think it's productive to have an increasing >> >> amount of tests running in CI if that means we'll be dealing with >> >> timeouts and intermittent crashes constantly. >> > >> > Test reliability is a different thing. If a particular test is >> > flaky, it needs to either be fixed or disabled. >> >> The problem is that in this community the idea of "fix" is: wait until >> someone with the appropriate skill level and interest stumbles upon the >> problem on their own and fix it in anger. >> >> For it to be a proper strategy, we'd need to create an issue in gitlab >> referencing the bug, have a proper reproducer and encourage contributors >> to work on the issue. > > It is policy that we should be creating issues in gitlab for any > flaky tests. I wouldn't say we've been perfect at that, but we > should be doing it, and that link ought to be linked in the code > if we disable the test there. > >> - It was disabled in March 2023 and stood there *not testing anything* >> while a major refactoring of the test code was happening. >> >> - The test was fixed in June 2023, but not reenabled in fear of getting >> flak from the community for breaking CI again (or at least that's the >> feeling I got from talking to Juan). >> >> - mapped-ram (which relies entirely on multifd) started being worked on >> and I had to enable the test in my own branch to be able to test the >> code properly. While disabled, it caught several issues in mapped-ram. >> >> - In October 2023 the test is re-enabled an immediately exposes issues >> in the code. >> >> This is how I started working on the migration code. Maybe you can >> appreciate why I don't feel confident about this fix or disable >> strategy. It has eaten many hours of my work. > > The migration subsystem was definitely suffering from insufficient > maintainer resources available historically, which is reflected > in some of the testing problems we've had & largely how I ended > up spending so much time on migration code too. > >> > Looking at its execution time right now, I'd say migration test >> > is pretty good, considering the permutations we have to target. >> > >> > It gets a bad reputation because historically it has been as >> > much as x20 slower than it is today, and has also struggled >> > with reliability. The latter is a reflection of the complexity >> > of migration and and IMHO actually justifies greater testing, >> > as long as we put in time to address bugs. >> > >> > Also we've got one single test program, covering an entire >> > subsystem in one go, rather than lots of short individual >> > test programs, so migration unfairly gets blamed for being >> > slow, when it simply covers alot of functionality in one >> > program. >> >> And still you think it's not worth it having a separate target for >> testing migration. FWIW, I also proposed splittling it into multiple >> meson tests, which you also rejected. It would be so much easier to move >> all of this into a separate target and let those who want nothing do to >> with to just ignore it. > > In the qtests/meson.build, I see we register separate > suites - a generic "qtest" suite, and a "qtest-$TARGET" > suite. What's missing here is a suite for subsystem > classification, which I guess is more or less what you > proposed here for migration. > > How about (in addition to the idea of splitting migration-test > into one part run for all targets, and one part run for just > one target), we generalize this concept to work for any > subsystem tagging > > qtest_subsystems = { > 'migration-test': ['migration'], > 'cdrom-test': ['block'], > 'ahci-test': ['block'], > .... > }
This is interesting because it would allow a test to be considered in more than one subsystem. There are many tests that invoke migration themselves. > > > then when registering tests we could do > > suites = ['qtest', 'qtest-' + target_base] > foreach subsys: qtest_subsystems.get(test, []) > suites += ['qtest-' + subsys, 'qtest-' + target_base + '-' + subsys] Hopefully meson won't take this as a hint to construct a 1000 characters long line when invoking the test. > endforeach > > test(.... > suite: suites) > > that would give us a way to run just the migration tests, or > just the migration tests on x86_64, etc, likewise for other > subsystems we want to tag, while still keeping 'make check-qtest' > as the full coverage. > >> >> > Any tests in tree need to be exercised by CI as the minimum bar >> >> > to prevent bit rot from merges. >> >> > >> >> >> >> No disagreement here. But then I'm going to need advice on what to do >> >> when other maintainers ask us to stop writing migration tests because >> >> they take too long. I cannot send contributors away nor merge code >> >> without tests. >> > >> > In general, I think it is unreasonable for other maintainers to >> > tell us to stop adding test coverage for migration, and would >> > push back against such a request. >> >> This might be a larger issue in QEMU. I also heard the same back in 2021 >> when doing ppc work: "don't add too many tests because the CI buckles >> and people get mad". The same with adding too much logging really. We're >> hostages to the gitlab CI unfortunately. > > Yep, we need more investment in our CI resources. There were some > ideas discussed at KVM Forum that could help with this, which > should be publicised soon. Great. Looking forward to those. > > >> > This feels like something that should be amenable to unit testing. >> > Might need a little re-factoring of migration code to make it >> > easier to run a subset of its logic in isolation, but that'd >> > probably be a win anyway, as such work usually makes code cleaner. >> >> I'll invest some time in that. I've no idea how we do unit testing in >> QEMU. > > Mostly starts with the standard glib test program setup, where by > you create a tests/unit/test-<blah>.c file, with functions for > each test case that you register with g_test_add, same as qtest. > > The key difference from qtest is that you then just directly > link the test binary to the code to be tested, and call into > it internal QEMU APIs directly. In this case, it would involve > linking to the 'libmigration' meson static library object. Thanks for the introduction. I'll pick some self-contained part of migration and see how far we are from being able to write unit tests. > > > With regards, > Daniel