This was brought to my attention: http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04805565
While it has no relation to why it would be triggered by iommu (it should isolate, not link access together right?) it might be worth the FW upgrade to verify if it fixes the issue. I'll report back once I was able to do so. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1679208 Title: Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with intel_iommu=on Status in linux package in Ubuntu: Triaged Bug description: TL;DR - one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on - the Disk controller fails - Xenial seems to work for a while but then fails - Zesty 100% crashes on boot - An identical system seems to work, so need HW replace to finally confirm After reboot one sees a HW report like this: After the boot I see the HW telling me this on boot: Embedded RAID : Smart HBA H240ar Controller - Operation Failed - 1719-Slot 0 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x13) I tried several things (In between always redeploy zesty with MAAS). I think my debugging might be helpful, but I wanted to keep the documentation in the bug in case you'd go another route or that others find useful information in here. 0. I retried what I did twice, fully reproducible That is: 0.1 install zesty 0.2 change grub default cmdline in /etc/default/grub.d/50- to add intel_iommu=on 0.3 sudo update-grub 0.4 reboot 1. I tried a Recovery boot from the boot options in gub. => Failed as well 2. iLO rebooted vis "request reboot" and as well via "full system reset" => both Failed 3. Reboot the system as deployed by MAAS # /proc/cmdline before that BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro The orig grub.cfg is like http://paste.ubuntu.com/24305945/ It reboots as-is. => Reboot worked 4. without a change to anything in /etc run update-grub $ sudo update-grub Generating grub configuration file ... Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT is set is no longer supported. Found linux image: /boot/vmlinuz-4.10.0-14-generic Found initrd image: /boot/initrd.img-4.10.0-14-generic Adding boot menu entry for EFI firmware configuration done There was no diff between the new grub.cfg and the one I saved. => Reboot worked 5. add the intel_iommu=on arg $ sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' /etc/default/grub.d/50-curtin-settings.cfg $ sudo update-grub # Diff in grub.cfg really only is the iommu setting => Reboot Failed So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me - maybe intel_iommu bheaves different? - Check grub cfg pre/post - not change but the expected? 6. Install Xenial and do the same => Reboot working 7. Upgrade to Z Since the Xenial system just worked and one can assume that almost only kernel is working so early in the boot process I upgraded the working system with intel_iommu=on to Zesty. That would be 4.4.0-71-generic to 4.10.0-1 On this upgrade I finally saw my I/O errors again :-/ Note: these issues are hard to miss as they mount root as read-only. I wonder if they only ever appear with intel_iommu=on as this is the only combo I ever saw them, 8. Redeploy and upgrade to Z without intel_iommu=on enabled Then enable intel_iommu=on and reboot => Reboot Fail From here I rebooted into the Xenial kerenl (that since this is an update was still there) Here I saw: Loading Linux 4.4.0-71-generic ... Loading initial ramdisk ... error: invalid video mode specification `text'. Booting in blind mode Hrm, as outlined above the "blind mode" might be a red herring, but since this kernel worked before it might still be a red herring that swims in the initrd that got regenerated on the upgrade. => Xenial Kernel Reboot - works !! So "blind mode" is a red herring of some sort. But this might allow to find some logs => No This appears as if the Failing boot has never made it to the point to actually write anything. I see: 1. the original xenial 2. the upgraded zesty 3. NOT THE zesty+iommu 4. the xenial+iommu $ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 4.4.0-71.92-generic 4.4.49) Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Linux version 4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 4.10.0-14.16-generic 4.10.3) Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Linux version 4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 4.4.0-71.92-generic 4.4.49) Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on 9. Trying to avoiding HW replacement if not needed I was afraid I might need the HW to be replaced to be 100% sure, but this very much smells broken in SW to me already. To avoid RT ticket replacing without real need I asked to free another system up. So I finally could free up a identical machine. I especially checked the failing HP smart array, it has the same Product Version and FW revision. There things seem to work, so I might be down to replacing the HW :-/ 10. get some messages of the fail: With the following grub cmdline I got to see the fail: GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200" It looks just like the one I found on the running system when intel_iommu=on is set on the Xenial kernel happening later (sometimes minutes, sometimes days, but never without intel_iommu). But on zesty it seems to trigger 100% on boot and by that not even get up. I'll attach a few logs of the crashes, but the heads are [ 33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD Smart Path configuration change) [ 618.567636] DMAR: DRHD: handling fault status reg 2 [ 618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr ffafc000 DMAR:[fault reason 06] PTE Read access is not set Or [ 159.779566] hpsa 0000:03:00.0: Command timed out. [ 159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: Tag:0x00000000:000000d0: unknown abort service response 0x00 While it might be a HW issue I file this still to be "findable" for anyone else if it is no HW eventually. But I assign myself for now to close/confirm once I have replaced HW. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp