** Description changed: [ Impact ] Oracle Cloud provides users with baremetal instances, and two types of VM instances (native and paravirtualized). Native VMs and baremetal use ISCSI, while the paravirtualized VMs don't. Oracle requires a single image which can run in all instance types, so it's not possible to provide an image with ISCSI enabled only for the instances that boot from it. Our images set ISCSI_AUTO to be compatible with those. Additionally, clouds generally don't specify command line args at boot so they can't simply enable or disable ISCSI on a per instance basis. Oracle now has IPV6-only instances. On fully virtualized instances there is no IP configuration coming from ibft, and configure_networking() is trying to get network information through DHCP in initramfs, but starting with IPv4. That generates a significant delay (up to 5 minutes) when booting. Even the IPv6 address the instance gets is not useful, as the network can be configured later through cloud-init. The fix here skips configure_networking(), delegating it to cloud-init, and speeding up the boot process on Oracle Cloud instances. [ Test Plan ] Thanks to Alec Warren <[email protected]> for the detailed test plan. 1. Maintains current behaviour by default when cmdline arg is NOT set a. Test setup: - Ubuntu image (which uses ISCSI_AUTO mode) containing this change - New cmdline arg "iscsi_auto_skip_initramfs_networking" NOT set - Instance configurations: - non-ISCSI instance on Oracle Cloud (paravirtualized VM) - ISCSI instance on Oracle Cloud (native VM and Baremetal instance) b. Test Assertions:(edited; see comment #21) - Verified that the change does nothing and maintains current behavior - The echo call is NOT in the serial console logs during initramfs stage - Instance DOES have networking configured during initramfs and ephemeral networking is NOT needed by cloud-init - - Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network) (on non-ISCSI instances) + - Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network) (on ISCSI instances) - Verifiable by presence of /run/net-* files (these are created by configure_networking in initramfs) (on ISCSI instances) + - Verifiable via cloud-init logs (states that there is no networking from initramfs and sets up ephemeral network itself) (on non-ISCSI instances) + - Verifiable by no /run/net-* files existing (these would be created by configure_networking in initramfs) (on non-ISCSI instances) 2. Does not break ISCSI use case on ISCSI instances when enabled via cmdline arg a. Test setup: - Ubuntu image (which uses ISCSI_AUTO mode) containing this change - New cmdline arg "iscsi_auto_skip_initramfs_networking" IS set using grub - Instance configuration: - ISCSI instance on Oracle Cloud (native VM and Baremetal instance) - b.Test Assertions: + b.Test Assertions: - The echo call is NOT in the serial console logs during initramfs stage - Instance DOES have networking configured during initramfs and ephemeral networking is NOT needed by cloud-init - - Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network) + - Verifiable via cloud-init logs (states that network is configured and does not need to setup ephemeral network) - Verifiable by presence of /run/net-* files (these are created by configure_networking in initramfs) 3. Skips configuring networking on non-ISCSI instances when enabled via cmdline arg a. Test setup: - Ubuntu image (which uses ISCSI_AUTO mode) containing this change - New cmdline arg "iscsi_auto_skip_initramfs_networking" IS set using grub - Instance configuration: - non-ISCSI instance on Oracle Cloud (paravirtualized VM) b. Test Assertions: - The echo call IS present in the serial console logs during initramfs stage - Instance does NOT have networking configured during initramfs and ephemeral networking IS needed and setup by cloud-init - Verifiable via cloud-init logs (states that there is no networking from initramfs and sets up ephemeral network itself) - Verifiable by no /run/net-* files existing (these would be created by configure_networking in initramfs) - Boot speed is measurably faster than normal (~10-12s instead of the normal 20s+) [ Where problems could occur ] Because this change targets a bug in a specific scenario, the check is explicitly applying to instances where the flag is present, ISCSI_AUTO is set but there is no ibft data in the system. Mistakes in the logic would make this change run in other scenarios, which is not the goal of this fix. Any mistake in trying to make this configuration completely opt-in would break existing instances in the sense that configure_networking() may not run when it should. To avoid that we explicitly check for the flag, and don't act if it is not set. The expected behavior can be verified using the test steps above. Usage wise, if there is any mistake in setting the flag, the worse that can happen is that the code won't detect it as it should, and then the bug triggers, and users will experience longer boot times, just as it happens now without the change. [ Other Info ] As explained above, there is a requiremen from Oracle Cloud that makes it impossible to just unset ISCSI configuration on the images when spinning non-ISCSI instances. This is the reason an opt-in flag is used to opt-out from the network configuration. We know it may be not ideal, but this enables our cloud teams to set the flag on Oracle Cloud images without harming other users - which just don't use it. This changeset has been forwarded to Debian, but on their side there were some questions and suggestions to improve the approach taken. If Debian ends up changing the way this situation is handled, we may change it in the development release to eliminate, or at least reduce, the delta which was introduced. However, no new SRUs should happen on this matter, as this change is considered maintainable for the foreseeable future. [ Original Description ] Cloud instances that configure network over DHCP in initramfs, will go through a "for ROUNDTTT in 30 60 90 120" loop inside configure_networking(). If the DHCP server is only offering a IPv6 (no IPv4), the instance will take more than 5 minutes to boot, because it will first go through a loop trying to obtain IPv4 IP (dhcpcd -1KL -t $ROUNDTTT -4 ${DEVICE:+"${DEVICE}"}) for 30+60+90+120 seconds (total 300 seconds - 5 minutes), which won't work, until it times out, and then resume the boot process. In https://bugs.launchpad.net/ubuntu/+source/initramfs- tools/+bug/2091904 initramfs-tools improved this situation, looking for IPv6 information in /sys/firmware/ibft/ethernet*/ip-addr to decide whether to look for IPv6 or IPv4, however that assumes that IP information will be available through ibft, which is not always true. If no IP information is available through ibft, we still go through this incorrect loop, delaying the boot process. Example from an instance booting through virtual disks, with no ibft, and IPv6-only on Oracle Cloud: ``` [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.0-1001-oracle root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0 nvme.shutdown_timeout=10 libiscsi.debug_libiscsi_eh=1 crash_kexec_post_notifiers [...] Begin: Running /scripts/init-premount ... done. Begin: Mounting root file system ... Begin: Running /scripts/local-top ... [ 2.863248] No iBFT detected. Could not setup fw entries. Begin: Waiting up to 180 secs for any network device to become available ... done. dhcpcd-10.1.0 starting dev: loaded udev [ 2.906793] 8021q: 802.1Q VLAN Support v1.8 [ 2.917496] 8021q: adding VLAN 0 to HW filter on device enp0s5 DUID 00:03:00:01:02:00:17:36:95:6d enp0s5: IAID 17:36:95:6d enp0s5: carrier acquired enp0s5: IAID 17:36:95:6d [ 2.983134] workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 7 times, consider switching to WQ_UNBOUND enp0s5: soliciting a DHCP lease timed out exiting due to oneshot dhcpcd exited Sleeping 0 seconds before retrying getting a DHCP lease dhcpcd-10.1.0 starting dev: loaded udev DUID 00:03:00:01:02:00:17:36:95:6d enp0s5: IAID 17:36:95:6d enp0s5: soliciting a DHCP lease timed out exiting due to oneshot dhcpcd exited Sleeping 0 seconds before retrying getting a DHCP lease dhcpcd-10.1.0 starting dev: loaded udev DUID 00:03:00:01:02:00:17:36:95:6d enp0s5: IAID 17:36:95:6d enp0s5: soliciting a DHCP lease timed out exiting due to oneshot dhcpcd exited Sleeping 0 seconds before retrying getting a DHCP lease dhcpcd-10.1.0 starting dev: loaded udev DUID 00:03:00:01:02:00:17:36:95:6d enp0s5: IAID 17:36:95:6d enp0s5: soliciting a DHCP lease timed out exiting due to oneshot dhcpcd exited Sleeping 0 seconds before retrying getting a DHCP lease no search or nameservers found in /run/net-.conf /run/net-*.conf /run/net6-*.conf [ 303.057039] Loading iSCSI transport class v2.0-870. [ 303.069113] iscsi: registered transport (tcp) Could not get boot entry. done. ``` Full log: https://pastebin.ubuntu.com/p/Sk5dcvpPyY/ We can see such loop between lines 1136 and 1176.
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2098515 Title: [SRU] IPv6-only (single stack) instances configuring network over dhcp in initramfs will take a long time to boot due to loop in dhcpcd -4 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/open-iscsi/+bug/2098515/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
