On Thu, Oct 17, 2024 at 01:29:35PM -0300, Fabiano Rosas wrote: > Daniel P. Berrangé <berra...@redhat.com> writes: > > > On Thu, Oct 17, 2024 at 11:32:11AM -0300, Fabiano Rosas wrote: > >> Recent changes to how we invoke the migration tests have > >> (intentionally) caused them to not be part of the check-qtest target > >> anymore. Add the check-migration-quick target so we don't lose > >> migration code testing in this job. > > > > But 'check-migration-quick' is only the subset of migration tests, > > 'check-migration' is all of the migration tests. So surely this is > > a massive regressions in covage in CI pipelines. > > I'm not sure it is. There are tests there already for all the major > parts of the code: precopy, postcopy, multifd, socket. Besides, we can > tweak migration-quick to cover spots where we think we're losing > coverage.
Each of the tests in migration-test were added for a good reason, generally to address testing gaps where we had functional regressions in the past. I don't think its a good idea to stop running such tests in CI as gating on new contributions. Any time we've had optional tests in QEMU, we've seen repeated regressions in the area in question. > Since our CI offers nothing in terms of reproducibility or > debuggability, I don't think it's productive to have an increasing > amount of tests running in CI if that means we'll be dealing with > timeouts and intermittent crashes constantly. Test reliability is a different thing. If a particular test is flaky, it needs to either be fixed or disabled. Splitting into a fast & slow grouping doesn't address reliability, just hides the problem from view. > > Experience shows us that relying on humans to run tests periodically > > doesn't work well, and they'll slowly bit rot. Migration maintainers > > don't have a way to run this as gating test for every pull request > > that merges, and pull requests coming from non-migration maintainers > > can still break migration code. > > Right, but migration code would still be tested with migration-quick > which is executed at every make check. Do we really need the full set in > every pull request? We must draw a line somewhere, otherwise make check > will just balloon in duration. Again, the tests all exist because migration code is incredibly complicated, with a lot of permutations, with a history of being very bug / regression prone. With that in mind, it is unavoidable that we're going to have a significant testing overhead for migration code. Looking at its execution time right now, I'd say migration test is pretty good, considering the permutations we have to target. It gets a bad reputation because historically it has been as much as x20 slower than it is today, and has also struggled with reliability. The latter is a reflection of the complexity of migration and and IMHO actually justifies greater testing, as long as we put in time to address bugs. Also we've got one single test program, covering an entire subsystem in one go, rather than lots of short individual test programs, so migration unfairly gets blamed for being slow, when it simply covers alot of functionality in one program. > > Any tests in tree need to be exercised by CI as the minimum bar > > to prevent bit rot from merges. > > > > No disagreement here. But then I'm going to need advice on what to do > when other maintainers ask us to stop writing migration tests because > they take too long. I cannot send contributors away nor merge code > without tests. In general, I think it is unreasonable for other maintainers to tell us to stop adding test coverage for migration, and would push back against such a request. We should, however, continue to optimize how we add further test coverage, where practical, overload testing of multiple features onto a single test case helps. We've already massively optimized the migration-test compared to its historical behaviour. A potentially bigger win could be seen if we change how we exercise the migration functionality. Since we had the migration qtest that runs a full migration operation, we've tended to expand testing by adding new qtest functions. ie we've added a functional test for everything we want covered. This is nice & simple, but also expensive. We've ignored unit testing, which I think is a mistake. If i look at the test list: # /x86_64/migration/bad_dest # /x86_64/migration/analyze-script # /x86_64/migration/validate_uuid # /x86_64/migration/validate_uuid_error # /x86_64/migration/validate_uuid_src_not_set # /x86_64/migration/validate_uuid_dst_not_set # /x86_64/migration/dirty_ring # /x86_64/migration/precopy/file # /x86_64/migration/precopy/unix/plain # /x86_64/migration/precopy/unix/suspend/live # /x86_64/migration/precopy/unix/suspend/notlive # /x86_64/migration/precopy/unix/tls/psk # /x86_64/migration/precopy/unix/tls/x509/default-host # /x86_64/migration/precopy/unix/tls/x509/override-host # /x86_64/migration/precopy/file/offset # /x86_64/migration/precopy/file/mapped-ram # /x86_64/migration/precopy/file/offset/fdset # /x86_64/migration/precopy/file/offset/bad # /x86_64/migration/precopy/file/mapped-ram/live # /x86_64/migration/precopy/tcp/plain # /x86_64/migration/precopy/tcp/plain/switchover-ack # /x86_64/migration/precopy/tcp/tls/psk/match # /x86_64/migration/precopy/tcp/tls/psk/mismatch # /x86_64/migration/precopy/tcp/tls/x509/default-host # /x86_64/migration/precopy/tcp/tls/x509/override-host # /x86_64/migration/precopy/tcp/tls/x509/mismatch-host # /x86_64/migration/precopy/tcp/tls/x509/friendly-client # /x86_64/migration/precopy/tcp/tls/x509/hostile-client # /x86_64/migration/precopy/tcp/tls/x509/allow-anon-client # /x86_64/migration/precopy/tcp/tls/x509/reject-anon-client # /x86_64/migration/precopy/fd/tcp # /x86_64/migration/precopy/fd/file # /x86_64/migration/multifd/file/mapped-ram # /x86_64/migration/multifd/file/mapped-ram/live # /x86_64/migration/multifd/file/mapped-ram/dio # /x86_64/migration/multifd/file/mapped-ram/fdset # /x86_64/migration/multifd/file/mapped-ram/fdset/dio # /x86_64/migration/multifd/tcp/uri/plain/none # /x86_64/migration/multifd/tcp/channels/plain/none # /x86_64/migration/multifd/tcp/plain/cancel # /x86_64/migration/multifd/tcp/plain/zlib # /x86_64/migration/multifd/tcp/plain/zstd # /x86_64/migration/multifd/tcp/plain/zero-page/legacy # /x86_64/migration/multifd/tcp/plain/zero-page/none # /x86_64/migration/multifd/tcp/tls/psk/match # /x86_64/migration/multifd/tcp/tls/psk/mismatch # /x86_64/migration/multifd/tcp/tls/x509/default-host # /x86_64/migration/multifd/tcp/tls/x509/override-host # /x86_64/migration/multifd/tcp/tls/x509/mismatch-host # /x86_64/migration/multifd/tcp/tls/x509/allow-anon-client # /x86_64/migration/multifd/tcp/tls/x509/reject-anon-client # /x86_64/migration/validate_uri/channels/both_set # /x86_64/migration/validate_uri/channels/none_set Individually none of those is very slow on its own - 10 are in the 2-3 second range, 35 are 1-2 secs, and 6 are less than 1 second. A very large portion of those are validating different ways to establish migration. Hardly any of them actually need to run a migration to completion. Even without running to completion though, we have the overheads of spawning 2 QEMUs. This feels like something that should be amenable to unit testing. Might need a little re-factoring of migration code to make it easier to run a subset of its logic in isolation, but that'd probably be a win anyway, as such work usually makes code cleaner. > Do we need nightly CI runs? Unit tests? Bear in mind there's a resource > allocation issue there. Addressing problems with timeouts/races in our > CI is not something any random person can do. In terms of running time, I think migration-test is acceptable as it is to run in 'make check' by default and doesn't justify dropping test coverage. We should still look to optimize & move to unit testing more code, and any reliability issues are something that needs to be addressed too. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|