On Thu, Feb 20, 2025 at 10:31:02AM +0100, Jürgen Groß wrote:
> On 20.02.25 10:16, Roger Pau Monné wrote:
> > On Wed, Feb 19, 2025 at 07:37:47PM +0100, Paweł Srokosz wrote:
> > > Hello,
> > > 
> > > > So the issue doesn't happen on debug=y builds? That's unexpected.  I 
> > > > would
> > > > expect the opposite, that some code in Linux assumes that pfn + 1 == 
> > > > mfn +
> > > > 1, and hence breaks when the relation is reversed.
> > > 
> > > It was also surprising for me but I think the key thing is that debug=y
> > > causes whole mapping to be reversed so each PFN lands on completely 
> > > different
> > > MFN e.g. MFN=0x1300000 is mapped to PFN=0x20e50c in ndebug, but in debug
> > > it's mapped to PFN=0x5FFFFF. I guess that's why I can't reproduce the
> > > problem.
> > > 
> > > > Can you see if you can reproduce with dom0-iommu=strict in the Xen 
> > > > command
> > > > line?
> > > 
> > > Unfortunately, it doesn't help. But I have few more observations.
> > > 
> > > Firstly, I checked the "xen-mfndump dump-m2p" output and found that 
> > > misread
> > > blocks are mapped to suspiciously round MFNs. I have different versions of
> > > Xen and Linux kernel on each machine and I see some coincidence.
> > > 
> > > I'm writing few huge files without Xen to ensure that they have been 
> > > written
> > > correctly (because under Xen both read and writeback is affected). Then 
> > > I'm
> > > booting to Xen, memory-mapping the files and reading each page. I see 
> > > that when
> > > block is corrupted, it is mapped on round MFN e.g. 
> > > pfn=0x5095d9/mfn=0x1600000,
> > > another on pfn=0x4095d9/mfn=0x1500000 etc.
> > > 
> > > On another machine with different Linux/Xen version these faults appear on
> > > pfn=0x20e50c/mfn=0x1300000, pfn=0x30e50c/mfn=0x1400000 etc.
> > > 
> > > I also noticed that during read of page that is mapped to
> > > pfn=0x20e50c/mfn=0x1300000, I'm getting these faults from DMAR:
> > > 
> > > ```
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 1200000000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 1200001000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 1200006000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 1200008000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 1200009000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 120000a000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > (XEN) [VT-D]DMAR:[DMA Write] Request device [0000:65:00.0] fault addr 
> > > 120000c000
> > > (XEN) [VT-D]DMAR: reason 05 - PTE Write access is not set
> > > ```
> > 
> > That's interesting, it seems to me that Linux is assuming that pages
> > at certain boundaries are superpages, and thus it can just increase
> > the mfn to get the next physical page.
> 
> I'm not sure this is true. See below.
> 
> > > and every time I'm dropping the cache and reading this region, I'm getting
> > > DMAR faults on few random addresses from 1200000000-120000f000 range (I 
> > > guess
> > > MFNs 0x1200000-120000f). MFNs 0x1200000-0x12000ff are not mapped to any 
> > > PFN in
> > > Dom0 (based on xen-mfndump output.).
> > 
> > It would be very interesting to figure out where those requests
> > originate, iow: which entity in Linux creates the bios with the
> > faulting address(es).
> 
> I _think_ this is related to the kernel trying to get some contiguous areas
> for the buffers used by the I/Os. As those areas are being given back after
> the I/O, they don't appear in the mfndump.
> 
> > It's a wild guess, but could you try to boot Linux with swiotlb=force
> > on the command line and attempt to trigger the issue?  I wonder
> > whether imposing the usage of the swiotlb will surface the issues as
> > CPU accesses, rather then IOMMU faults, and that could get us a trace
> > inside Linux of how those requests are generated.
> > 
> > > On the other hand, I'm not getting these DMAR faults while reading other 
> > > regions.
> > > Also I can't trigger the bug with reversed Dom0 mapping, even if I fill 
> > > the page
> > > cache with reads.
> > 
> > There's possibly some condition we are missing that causes a component
> > in Linux to assume the next address is mfn + 1, instead of doing the
> > full address translation from the linear or pfn space.
> 
> My theory is:
> 
> The kernel is seeing the used buffer to be a physically contiguous area,
> so it is _not_ using a scatter-gather list (it does in the debug Xen case,
> resulting in it not to show any errors). Unfortunately the buffer is not
> aligned to its size, so swiotlb-xen will remap the buffer to a suitably
> aligned one. The driver will then use the returned machine address for
> I/Os to both the devices of the RAID configuration. When the first I/O is
> done, the driver probably is calling the DMA unmap or device sync function
> already, causing the intermediate contiguous region to be destroyed again
> (this is the time when the DMAR errors should show up for the 2nd I/O still
> running).
> 
> So the main issue IMHO is, that a DMA buffer mapped for one device is used
> for 2 devices instead.

But that won't cause IOMMU faults?  Because the memory used by the
bounce buffer would still be owned by dom0 (and thus part of it's IOMMU
page-tables), just probably re-written to contain different data.

Or is the swiotlb contiguous region torn down after every operation?
That would seem extremely wasteful to me, I assume the buffer is
allocated during device init, and stays the same until the device is
detached.

Thanks, Roger.

Reply via email to