[+CC pabs@d.o for autopkgtest infrastructure questions]
On Sat, 02 Nov 2024 12:46:29 +0000 Luca Boccassi <bl...@debian.org> wrote:
Dear maintainer(s), The netplan.io autopkgtest on riscv64 fail roughly 50% of the runs. As per RT, consistently flaky autopkgtest are RC. They seem to be all timeouts, so probably due to riscv64 test machines being very slow, but this is just a guess. https://ci.debian.net/packages/n/netplan.io/testing/riscv64/
Hi Luca, thanks for bringing up the issue! Looking into those timeouts a bit more closely, I don't think those are caused by the slowness of riscv64 runners. The Netplan tests seem to fail consistently, but only on very specific hosts, namely "debci-{30,31,32}". While it passes on any other DebCI host/runner. So green tests got lucky running on a proper host. The common theme between debci-30/31/32 seems to be the kernel, which is a bit older and doesn't seem to be an official Debian kernel: "testbed running kernel: Linux 6.6.52-win2030 #2024.09.27.00.18+24089c696 SMP Fri Sep 27 00:27:03 UTC 2024" Also, for the "flaky" routing tests I could find additional logs, like: 2334s ====================================================================== 2334s FAIL: test_route_type_local_lp1892272 (__main__.TestNetworkd.test_route_type_local_lp1892272) 2334s ---------------------------------------------------------------------- 2334s Traceback (most recent call last): 2334s File "/tmp/autopkgtest-lxc.ieq7nj3w/downtmp/build.iS2/src/tests/integration/routing.py", line 433, in test_route_type_local_lp1892272 2334s self.assertIn(b'local default', 2334s AssertionError: b'local default' not found in b'' Or: 2275s test_route_with_policy (__main__.TestNetworkd.test_route_with_policy) ... eth42 ............................................................● 105: eth42 2275s Link File: /usr/lib/systemd/network/99-default.link 2275s Network File: /run/systemd/network/10-netplan-ethbn.network 2275s State: routable (failed) 2275s Online state: online 2275s Type: ether 2275s Kind: veth 2275s Driver: veth 2275s Hardware Address: fa:10:fc:ec:42:2e 2275s MTU: 1500 (min: 68, max: 65535) 2275s QDisc: noqueue 2275s IPv6 Address Generation Mode: eui64 2275s Number of Queues (Tx/Rx): 4/4 2275s Auto negotiation: no 2275s Speed: 10Gbps 2275s Duplex: full 2275s Port: tp 2275s Address: 10.20.10.1 2275s fe80::f810:fcff:feec:422e 2275s Activation Policy: up 2275s Required For Online: yes 2275s DHCPv6 Client DUID: DUID-EN/Vendor:0000ab11e42fd7030b558586 2275s 2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Link UP 2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Gained carrier 2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Could not add routing policy rule: Rule family not supported. Address family not supported by protocol 2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Failed 2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Trying to reconfigure the interface. 2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Configuring with /run/systemd/network/10-netplan-ethbn.network. 2275s Nov 02 13:17:47 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Could not add routing policy rule: Rule family not supported. Address family not supported by protocol 2275s Nov 02 13:17:47 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Failed This indicates that it's not actually a timeout (due to slowness), but rather systemd-networkd failed to configure the interface properly. The log message indicates that the kernel (netlink) might not support the required functionality (i.e. routing tables): "Nov 02 13:17:47 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Could not add routing policy rule: Rule family not supported. Address family not supported by protocol" Maybe we're lacking something like "CONFIG_IP_MULTIPLE_TABLES" in this specific kernel? Which would also explain the empty stdout of the prior "test_route_type_local_lp1892272" test's "ip route show ..." command. Similarly, the "flaky" tunnels test logs point to a lack of GRE functionality in this kernel (maybe missing modules?). For NetworkManager: 3368s .Error: Device 'tun0' not found. 3369s .Error: Device 'tun0' not found. 3370s .Error: Device 'tun0' not found. [...] For systemd-networkd: 3498s .Interface "tun0" not found. 3499s .Interface "tun0" not found. 3500s .Interface "tun0" not found. [...] The Netplan test just seems to be unable to create any GRE or GRE6 interface. Do we have CONFIG_IPV6_GRE and CONFIG_NET_IPGRE on this kernel? I could probably detect such cases and skip over those tests in Netplan. But I think this would degrade the usefulness of those tests. Autopkgtests of other packages relying on GRE/6 or routing tables will most probably also fail on those DebCI test runners. IMO we should be able to expect a relatively consistent DebCI environment, independent on the specific runner. As the Netplan tests pass on most other runners (riscv64 and other arches), I think we should rather fix (or disable?) those debci-30/31/32 hosts. Pabs, what do you think about this? Do you have any insights into that (apparently custom) 6.6.52-win2030 kernel? Cheers, Lukas