On 29/04/2023 2:34 pm, Marek Marczykowski-Górecki wrote:
> On Sat, Apr 29, 2023 at 12:41:26PM +0100, [email protected] wrote:
>> On 29/04/2023 4:05 am, Stefano Stabellini wrote:
>>> On Fri, 28 Apr 2023, GitLab wrote:
>>>> Pipeline #852233694 triggered by
>>>> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
>>>> Ganis
>>>> had 3 failed jobs
>>>> Failed jobs
>>>> ✖
>>>> test
>>>> qemu-smoke-dom0less-arm64-gcc
>>> This is a real failure on staging. Unfortunately it is intermittent. It
>>> usually happens once every 3-8 tests for me.
>>>
>>> The test script is:
>>> automation/scripts/qemu-smoke-dom0less-arm64.sh
>>>
>>> and for this test it is invoked without arguments. It is starting 2
>>> dom0less VMs in parallel, then dom0 does a xl network-attach and the
>>> domU is supposed to setup eth0 and ping.
>>>
>>> The failure is that nothing happens after "xl network-attach". The domU
>>> never hotplugs any interfaces. I have logs that show that eth0 never
>>> shows up and the only interface is lo no matter how long we wait.
>>>
>>>
>>> On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
>>> without any failures. I have not investigated further but it looks like
>>> one of these 4 commits is the problem:
>>>
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use 
>>> xc_domain_getinfolist()
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid 
>>> using xc_domain_getinfo()
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Create 
>>> xc_domain_getinfo_single()
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of 
>>> xc_domain_getinfo() use xc_domain_getinfol 
>> In commit order (reverse of above), these patches are:
>>
>> 1) Modify the python bindings and xenbaked
>> 2) Introduce a new library function with a better API/ABI
>> 3) Modify xenconsoled
>> 4) Modify init-xenstore-domain
>>
>> The test isn't using anything from 4 or 1, and 2 definitely isn't
>> breaking anything on its own.
>>
>> That just leaves 3.  This test does turn activate xenconsoled by virtue
>> of invoking xencommons, but that doesn't help explain why a change in
>> xenconsoled interferes (and only intermittently on this one single test)
>> with `xl network-attach`.
>>
>> The xenconsoled change does have correctness fix in it, requiring
>> xenconsoled to ask for all domains info in one go.  This does mean it's
>> hypercall-buffering (i.e. bouncing) a 4M array now where previously it
>> was racy figuring out which VMs had come and gone.
> Can it be that xl network-attach fails and that failure is silently
> ignored by the test?

Well, it's ultimately doing a ping test between the two VMs, so the
network-attach is rather important.  I don't see an obviously way for us
to get false negatives like this.

~Andrew

Reply via email to