[+CC pabs@d.o for autopkgtest infrastructure questions]

On Sat, 02 Nov 2024 12:46:29 +0000 Luca Boccassi <bl...@debian.org> wrote:
Dear maintainer(s),

The netplan.io autopkgtest on riscv64 fail roughly 50% of the runs. As
per RT, consistently flaky autopkgtest are RC. They seem to be all
timeouts, so probably due to riscv64 test machines being very slow, but
this is just a guess.

https://ci.debian.net/packages/n/netplan.io/testing/riscv64/
Hi Luca, thanks for bringing up the issue!

Looking into those timeouts a bit more closely, I don't think those are
caused by the slowness of riscv64 runners. The Netplan tests seem to fail
consistently, but only on very specific hosts, namely "debci-{30,31,32}".
While it passes on any other DebCI host/runner. So green tests got lucky
running on a proper host.

The common theme between debci-30/31/32 seems to be the kernel, which is
a bit older and doesn't seem to be an official Debian kernel:

"testbed running kernel: Linux 6.6.52-win2030 #2024.09.27.00.18+24089c696 SMP Fri 
Sep 27 00:27:03 UTC 2024"

Also, for the "flaky" routing tests I could find additional logs, like:

2334s ======================================================================
2334s FAIL: test_route_type_local_lp1892272 
(__main__.TestNetworkd.test_route_type_local_lp1892272)
2334s ----------------------------------------------------------------------
2334s Traceback (most recent call last):
2334s   File 
"/tmp/autopkgtest-lxc.ieq7nj3w/downtmp/build.iS2/src/tests/integration/routing.py",
 line 433, in test_route_type_local_lp1892272
2334s     self.assertIn(b'local default',
2334s AssertionError: b'local default' not found in b''


Or:

2275s test_route_with_policy (__main__.TestNetworkd.test_route_with_policy) ... 
eth42 ............................................................● 105: eth42
2275s                    Link File: /usr/lib/systemd/network/99-default.link
2275s                 Network File: 
/run/systemd/network/10-netplan-ethbn.network
2275s                        State: routable (failed)
2275s                 Online state: online
2275s                         Type: ether
2275s                         Kind: veth
2275s                       Driver: veth
2275s             Hardware Address: fa:10:fc:ec:42:2e
2275s                          MTU: 1500 (min: 68, max: 65535)
2275s                        QDisc: noqueue
2275s IPv6 Address Generation Mode: eui64
2275s     Number of Queues (Tx/Rx): 4/4
2275s             Auto negotiation: no
2275s                        Speed: 10Gbps
2275s                       Duplex: full
2275s                         Port: tp
2275s                      Address: 10.20.10.1
2275s                               fe80::f810:fcff:feec:422e
2275s            Activation Policy: up
2275s          Required For Online: yes
2275s           DHCPv6 Client DUID: DUID-EN/Vendor:0000ab11e42fd7030b558586
2275s
2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Link UP
2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Gained 
carrier
2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Could not 
add routing policy rule: Rule family not supported. Address family not 
supported by protocol
2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Failed
2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Trying to 
reconfigure the interface.
2275s Nov 02 13:17:46 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: 
Configuring with /run/systemd/network/10-netplan-ethbn.network.
2275s Nov 02 13:17:47 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Could not 
add routing policy rule: Rule family not supported. Address family not 
supported by protocol
2275s Nov 02 13:17:47 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Failed


This indicates that it's not actually a timeout (due to slowness), but
rather systemd-networkd failed to configure the interface properly.
The log message indicates that the kernel (netlink) might not support
the required functionality (i.e. routing tables):
"Nov 02 13:17:47 ci-307-aa8d17e8 systemd-networkd[6978]: eth42: Could not add 
routing policy rule: Rule family not supported. Address family not supported by 
protocol"

Maybe we're lacking something like "CONFIG_IP_MULTIPLE_TABLES" in this
specific kernel? Which would also explain the empty stdout of the prior
"test_route_type_local_lp1892272" test's "ip route show ..." command.

Similarly, the "flaky" tunnels test logs point to a lack of GRE
functionality in this kernel (maybe missing modules?).

For NetworkManager:
3368s .Error: Device 'tun0' not found.
3369s .Error: Device 'tun0' not found.
3370s .Error: Device 'tun0' not found.
[...]

For systemd-networkd:
3498s .Interface "tun0" not found.
3499s .Interface "tun0" not found.
3500s .Interface "tun0" not found.
[...]

The Netplan test just seems to be unable to create any GRE or GRE6
interface. Do we have CONFIG_IPV6_GRE and CONFIG_NET_IPGRE on this
kernel?

I could probably detect such cases and skip over those tests in
Netplan. But I think this would degrade the usefulness of those tests.
Autopkgtests of other packages relying on GRE/6 or routing tables
will most probably also fail on those DebCI test runners.

IMO we should be able to expect a relatively consistent DebCI
environment, independent on the specific runner. As the Netplan
tests pass on most other runners (riscv64 and other arches), I
think we should rather fix (or disable?) those debci-30/31/32 hosts.

Pabs, what do you think about this? Do you have any insights into
that (apparently custom) 6.6.52-win2030 kernel?

Cheers,
  Lukas

Reply via email to