On 29/04/2023 2:34 pm, Marek Marczykowski-Górecki wrote: > On Sat, Apr 29, 2023 at 12:41:26PM +0100, [email protected] wrote: >> On 29/04/2023 4:05 am, Stefano Stabellini wrote: >>> On Fri, 28 Apr 2023, GitLab wrote: >>>> Pipeline #852233694 triggered by >>>> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon] >>>> Ganis >>>> had 3 failed jobs >>>> Failed jobs >>>> ✖ >>>> test >>>> qemu-smoke-dom0less-arm64-gcc >>> This is a real failure on staging. Unfortunately it is intermittent. It >>> usually happens once every 3-8 tests for me. >>> >>> The test script is: >>> automation/scripts/qemu-smoke-dom0less-arm64.sh >>> >>> and for this test it is invoked without arguments. It is starting 2 >>> dom0less VMs in parallel, then dom0 does a xl network-attach and the >>> domU is supposed to setup eth0 and ping. >>> >>> The failure is that nothing happens after "xl network-attach". The domU >>> never hotplugs any interfaces. I have logs that show that eth0 never >>> shows up and the only interface is lo no matter how long we wait. >>> >>> >>> On a hunch, I removed Alejandro patches. Without them, I ran 20 tests >>> without any failures. I have not investigated further but it looks like >>> one of these 4 commits is the problem: >>> >>> 2023-04-28 11:41 Alejandro Vallejo tools: Make init-xenstore-domain use >>> xc_domain_getinfolist() >>> 2023-04-28 11:41 Alejandro Vallejo tools: Refactor console/io.c to avoid >>> using xc_domain_getinfo() >>> 2023-04-28 11:41 Alejandro Vallejo tools: Create >>> xc_domain_getinfo_single() >>> 2023-04-28 11:41 Alejandro Vallejo tools: Make some callers of >>> xc_domain_getinfo() use xc_domain_getinfol >> In commit order (reverse of above), these patches are: >> >> 1) Modify the python bindings and xenbaked >> 2) Introduce a new library function with a better API/ABI >> 3) Modify xenconsoled >> 4) Modify init-xenstore-domain >> >> The test isn't using anything from 4 or 1, and 2 definitely isn't >> breaking anything on its own. >> >> That just leaves 3. This test does turn activate xenconsoled by virtue >> of invoking xencommons, but that doesn't help explain why a change in >> xenconsoled interferes (and only intermittently on this one single test) >> with `xl network-attach`. >> >> The xenconsoled change does have correctness fix in it, requiring >> xenconsoled to ask for all domains info in one go. This does mean it's >> hypercall-buffering (i.e. bouncing) a 4M array now where previously it >> was racy figuring out which VMs had come and gone. > Can it be that xl network-attach fails and that failure is silently > ignored by the test?
Well, it's ultimately doing a ping test between the two VMs, so the network-attach is rather important. I don't see an obviously way for us to get false negatives like this. ~Andrew
