Public bug reported:
On an Intel platform where VT-d is brought up in scalable mode, the kdump/kexec
crash kernel faults on legitimate device DMA. The Intel IOMMU has no present
translation for the physical addresses devices use for DMA in the crash kernel,
so driver fails to operate. This makes kdump unreliable: the
crash kernel cannot complete normal device init / vmcore capture.
The failure is NOT specific to one driver -- two unrelated vendors' drivers fail
identically on the same machine (a storage HBA and a Broadcom bnx2x NIC), both
with the same scalable-mode DMAR fault. Disabling the IOMMU in the crash kernel
(intel_iommu=off) fully resolves it (20/20 kdump iterations pass).
Representative fault (storage HBA at 0000:8b:00.0):
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read NO_PASID] Request device [8b:00.0] fault addr 0xffffe000
[fault reason 0x71] SM: Present bit in first-level paging entry is clear
"SM:" + fault reason 0x71 indicates VT-d scalable mode and a not-present
first-level paging entry for the requested address.
ENVIRONMENT
--------------------------------------------------------------------------------
OS: Ubuntu 24.04.4 LTS (Noble)
Kernel (reproduces): 6.8.0-117-generic #117-Ubuntu SMP PREEMPT_DYNAMIC
Arch: x86_64
System: HPE ProLiant DL380 Gen11 (Intel platform, VT-d)
BIOS: 2.80 (also reproduced after update to 2.82)
IOMMU: Intel VT-d, scalable mode active (see fault reason 0x71 /
SM:)
Crash tooling: kdump-tools (kexec crash kernel)
Devices observed failing in crash kernel:
- Storage HBA at PCI 0000:8b:00.0 (vendor driver "smartxlr")
- Broadcom NIC at PCI 0000:23:00.1 (driver "bnx2x")
IMPACT / SEVERITY
--------------------------------------------------------------------------------
High. kdump is effectively non-functional on affected platforms because devices
that need DMA during crash-kernel boot fail. Reproduced with two independent
DMA-capable devices, so the impact is general, not device-specific.
STEPS TO REPRODUCE
--------------------------------------------------------------------------------
On an HPE ProLiant DL380 Gen11 (or any platform that enables Intel VT-d scalable
mode) running Ubuntu 24.04.4 with linux 6.8.0-117-generic and kdump-tools:
1. Configure and enable kdump (kdump-tools).
2. Trigger a kernel crash, e.g.:
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
3. Watch the crash-kernel console (serial/SOL).
Note: an active I/O workload is NOT required. The original report had storage
I/O running at crash time, but the failure also reproduces with no I/O workload.
EXPECTED RESULT
--------------------------------------------------------------------------------
With Intel VT-d enabled (scalable mode), the kdump crash kernel can perform
device DMA -- either by carrying over the prior kernel's translations or safely
re-establishing them -- so drivers initialize and the vmcore is captured.
ACTUAL RESULT
--------------------------------------------------------------------------------
The crash kernel's IOMMU has no present translation for device DMA buffers
(fault reason 0x71, scalable mode). DMA is blocked and DMA-capable drivers fail.
EVIDENCE
--------------------------------------------------------------------------------
1) The IOMMU fault address equals the exact physical address the driver passed.
A debug build of the storage driver logged the buffer physaddr it hands the
controller for its SIS->PQI init command, immediately before the fault:
XlrInitSisBaseStructAddress(): BufferPhysAddress 0xffffe000 ...
XlrSisSendSyncCmd(): command 0000001b Params[1]=0xffffe000 ...
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read NO_PASID] Request device [8b:00.0] fault addr 0xffffe000
[fault reason 0x71] SM: Present bit in first-level paging entry is
clear
Driver passes 0xffffe000; IOMMU faults on 0xffffe000. The address is valid;
the crash kernel simply has no present translation for it.
2) A second, unrelated device (bnx2x NIC) faults identically in the
crash kernel:
DMAR: [DMA Read NO_PASID] Request device [23:00.1] fault addr 0xff6b8000
[fault reason 0x71] SM: Present bit in first-level paging entry is
clear
bnx2x: [bnx2x_issue_dmae_with_comp:563(ens2f1)]DMAE timeout!
bnx2x: [bnx2x_write_dmae:611(ens2f1)]DMAE returned failure -1
... (repeats for >90 s) ...
bnx2x: [bnx2x_send_final_clnup:1423(ens2f1)]FW final cleanup did not
succeed
bnx2x: [bnx2x_panic_dump:929(ens2f1)]begin crash dump ----------------
Same fault reason (0x71, SM:), different vendor. Rules out a driver-specific
defect; points at platform VT-d translation state in the crash kernel.
3) Disabling the IOMMU in the crash kernel resolves it.
Adding intel_iommu=off to the CRASH KERNEL command line only (production
kernel unchanged): 20 iterations run, 20 passed, 0 failed.
4) Does not reproduce on other kernels / platforms (same driver + same test):
- Lenovo ThinkSystem SR650 V3, Ubuntu 24.04.4, 6.17.0-x-generic (HWE) --
PASS
- HPE ProLiant DL385 Gen10 Plus, Debian 13, 6.12.x --
PASS
- HPE ProLiant DL380 Gen10 Plus, Ubuntu 24.04.4, 6.8.0-124-generic --
PASS
Consistent with a scalable-mode crash-kernel handling gap that depends on the
platform enabling VT-d scalable mode. Notably 6.17 passes, suggesting a fix
may already exist upstream that needs backporting to the Ubuntu 6.8 GA
kernel.
WORKAROUND
--------------------------------------------------------------------------------
Append intel_iommu=off to the CRASH KERNEL command line only (kdump-tools):
edit KDUMP_CMDLINE_APPEND in /etc/default/kdump-tools (or
/etc/default/grub.d/kdump-tools.cfg), then reload:
sudo kdump-config unload && sudo kdump-config load
Confirmed: 20/20 kdump iterations pass. This affects only the crash kernel; the
production kernel keeps VT-d enabled.
** Affects: linux (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2157683
Title:
DMAR: [DMA Read NO_PASID] Request device [8b:00.0] fault addr
0xffffe000 [fault reason 0x71] SM: Present bit in first-level paging
entry is clear
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2157683/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs