Today I wanted to try and instrument the boot process a bit, since we have no serial console in the nitro metal instances.
I was looking for pstore_blk (hoping we could panic_on_warn or panic_on_oops), but it's only available in 5.8+ it seems.) So I decided to start with grub, and keep a progress variable in grubenv, and use grub-reboot to boot 4.15.0-1113-aws _once_ (as it's expected to fail), then (force) stop and start again, and check grubenv in 5.4.0-*-aws (which works.) Interestingly, in one of such attempts 4.15.0-1113-aws WORKED. In another attempt, I could see the progress variable for the 4.15 _and_ 5.14 kernels, so it seems that grub booted 4.15 but it didn't make it to the fully booted system. (i.e., grub seems to be working correctly.) In the other attempts I noticed that once we try to boot 4.15, the system seems to become weird and not react quickly even to the 'Force stop' method (after you try 'Stop' that doesn't work.) ... So, since 4.15 worked/booted once, and the systems seem weird, and Ian just posted that he had a different result/questioned previous result (ie, it might well be a _different_ result), I wonder if somehow this particular instance type is acting up. Given that 4.15 worked/booted ~20 times under kexec, it's not unreasonable to consider there might be something going on in normal boot. I think we should probably engage AWS Support to try and ask for a console log using an internally available method (seen it elsewhere iirc), and also to clarify differences in boot disk among instace types r5.metal (fail), r5d.metal (works), and r5d.24xlarge (works) -- they all have EBS/nvme as '/'. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal Status in linux-aws package in Ubuntu: New Bug description: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel- team/2021-September/123963.html With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' from mount options and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot after installing the 4.15 kernel. I have been unable to capture a stack trace using 'aws get-console- output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp