Hi,
On 6/27/2025 3:54 PM, Robin Murphy wrote: > +Vasant > > On 2025-06-27 6:39 am, Baochen Qiang wrote: >> [+ IOMMU list] >> >> On 6/27/2025 12:21 AM, Matt Mower wrote: >>> Dear maintainer, >>> >>> I have been experiencing lost network connection with the ath12k_pci driver >>> in the linux-6.12.y kernel branch. Often, when the issue occurs, the >>> network does not recover until I reboot the computer. A full report of the >>> errors I encounter, the symptoms that arise, and several dmesg attachments >>> are in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1107521 . I have >>> attached a dmesg from 6.12.34 for convenience. The short summary is: >>> >>> 1. I started noticing log lines like the following soon after boot when I >>> updated from 6.12.22 to 6.12.27. After these events occur, the network goes >>> down and often does not come back up. >>> ath12k_pci 0000:c2:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT >>> domain=0x0010 address=0xfea00000 flags=0x0020] >>> 2. I was able to reproduce this issue very rarely in 6.12.12 and 6.12.22. >>> The issue always occurs soon after boot in 6.12.27, 6.12.30, 6.12.33, and >>> 6.12.34. >>> 3. I have not reproduced the issue in 6.15.2 or 6.15.3. >>> 4. In some cases, when shutting down the computer, a kernel bug caused my >>> computer to hang. I haven't determined whether this is related to the issue >>> above or an independent issue. Search the bug report >>> for PXL_20250611_140820085.jpg to see a picture of the kernel bug on my >>> laptop screen. >>> 5. I have tested two firmware versions: >>> a. fw_version 0x1108811c fw_build_timestamp 2025-05-17 00:21 fw_build_id >>> QC_IMAGE_VERSION_STRING=WLAN.HMT.1.1.c5-00284.1-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 >>> b. fw_version 0x100301e1 fw_build_timestamp 2023-12-06 04:05 fw_build_id >>> QC_IMAGE_VERSION_STRING=WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 >>> >>> Thanks, >>> Matt >>> >> >> I had a quick test with 6.12.27 kernel on both my Intel desktop and AMD RD >> but >> didn't hit >> the issue. And I am using WLAN.HMT.1.1.c5-00284.1- >> QCAHMTSWPL_V1.0_V2.0_SILICONZ-3. >> >> As mentioned in the Debian bug report, since reverting ath12k patches does >> not >> fix this >> issue, maybe it comes from the IOMMU subsystem? > > Faults are usually still indicative of the client driver/subsystem doing > something not quite right - racily performing dma_unmap before the device has > actually finished making accesses; mapping the wrong size such that the device > accesses off the end of the mapping (this can often run into another valid > mapping so not necessarily fault); mapping the wrong DMA direction such that > the > device then tries to write to a read-only page. However I suppose it's not > impossible that some fix to amd-iommu in that period might have changed its > behaviour in a way that exacerbates things - Vasant, does this strike a chord > with anything you're aware of? I did look into kernel code and changes between v6.12.9..v6.12.22.. There are only two changes in AMD iommu driver. 40c731472f41 iommu/amd: Expicitly enable CNTRL.EPHEn bit in resume path -> This one was needed to fix the suspend/resume issue. This just adjusts control bit after suspend. Its not touching page table. 6e1e451456e1 iommu/amd: Remove unused amd_iommu_domain_update() - Code cleanup patch. Looking into lspci output only `c2:00.0` is placed in group 15 and domain ID 0x10. I believe there is only one device in this domain. Interpreting IO_PAGE_FAULT flags = 0x20 means It was a write request for the page that was not present. So at this point I would still suspect on device driver side than IOMMU side. > > A couple more things I'd try on the ath12k side: firstly, boot with > "iommu.strict=1" and see if that makes the faults any more frequent/ > reproducible; if a fault is fairly easily reproducible, then use the DMA API > and/or IOMMU API tracepoints to compare the fault address to prior DMA mapping > activity - that can usually reveal the nature of the bug enough to then know > what to go looking for. > > I wouldn't put much significance in whatever happens *after* the fault - > presumably the driver is assuming the blocked DMA write has completed, so then > goes on to read some incomplete descriptor as if it were valid, and thus may > fall over in all manner of entertaining ways on bogus data. Thanks Robin. I'd suggest to follow these suggestions. -Vasant