Hi Simon, > Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device > functions of Intel GPUs > > Hi, > > since I'm late to the party I'll reply to the entire thread in one go. > > On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote: > > > I think using a PCI BAR Address works just fine in this case because the Xe > > driver bound to PF on the Host can easily determine that it belongs to one > > of the VFs and translate it into VRAM Address. > > There are PCIe bridges that support address translation, and might apply > different translations for different PASIDs, so this determination would > need to walk the device tree on both guest and host in a way that does > not confer trust to the guest or allows it to gain access to resources > through race conditions. > > The difficulty here is that you are building a communication mechanism > that bypasses a trust boundary in the virtualization framework, so it > becomes part of the virtualization framework. I believe we can avoid > that to some extent by exchanging handles instead of raw pointers. > > I can see the point in using the dmabuf API, because it integrates well > with existing 3D APIs in userspace, although I don't quite understand > what the VK_KHR_external_memory_dma_buf extension actually does, > besides > defining a flag bit -- it seems the heavy lifting is done by the > VK_KHR_external_memory_fd extension anyway. But yes, we probably want > the interface to be compatible to existing sharing APIs on the host side > at least, to allow the guest's "on-screen" images to be easily imported. > > There is some potential for a shortcut here as well, giving these > buffers directly to the host's desktop compositor instead of having an > application react to updates by copying the data from the area shared > with the VF to the area shared between the application and the > compositor -- that would also be a reason to remain close to the > existing interface. > > It's not entirely necessary for this interface to be a dma_buf, as long > as we have a conversion between a file descriptor and a BO. On the > other hand, it may be desirable to allow re-exporting it as a dma_buf if > we want to access it from another device as well. > > I'm not sure that is a likely use case though, even the horrible > contraption I'm building here that has a Thunderbolt device send data > directly to VRAM does not require that, because the guest would process > the data and then send a different buffer to the host. Still would be > nice for completeness. > > The other thing that seems to be looming on the horizon is that dma_buf > is too limited for VRAM buffers, because once it's imported, it is > pinned as well, but we'd like to keep it moveable (there was another > thread on the xe mailing list about that). That might even be more > important if we have limited BAR space, because then we might not want > to make the memory accessible through the BAR unless imported by > something that needs access through the BAR, which we've established the > main use case doesn't (because it doesn't even need any kind of access). > > I think passing objects between trust domains should take the form of an > opaque handle that is not predictable, and refers to an internal data > structure with the actual parameters (so we pass these internally as > well, and avoid all the awkwardness of host and guest having different > world views. It doesn't matter if that path is slow, it should only be > used rather seldom (at VM start and when the VM changes screen > resolution). > > For VM startup, we probably want to provision guest "on-screen" memory > and semaphores really early -- maybe it makes sense to just give each VF > a sensible shared mapping like 16 MB (rounded up from 2*1080p*32bit) by > default, and/or present a ROM with EFI and OpenFirmware drivers -- can > VFs do that on current hardware? > > On Tue, Sep 23, 2025 at 05:53:06AM +0000, Kasireddy, Vivek wrote: > > > IIUC, it is a common practice among GPU drivers including Xe and Amdgpu > > to never expose VRAM Addresses and instead have BAR addresses as DMA > > addresses when exporting dmabufs to other devices. > > Yes, because that is how the other devices access that memory. > > > The problem here is that the CPU physical (aka BAR Address) is only > > usable by the CPU. > > The address you receive from mapping a dma_buf for a particular device > is not a CPU physical address, even if it is identical on pretty much > all PC hardware because it is uncommon to configure the root bridge with > a translation there. > > On my POWER9 machine, the situation is a bit different: a range in the > lower 4 GB is reserved for 32-bit BARs, the memory with those physical > addresses is remapped so it appears after the end of physical RAM from > the point of view of PCIe devices, and the 32 bit BARs appear at the > base of the PCIe bus (after the legacy ports). > > So, as an example (reality is a bit more complex :> ) the memory map > might look like > > 0000000000000000..0000001fffffffff RAM > 0060000000000000..006001ffffffffff PCIe domain 1 > 0060020000000000..006003ffffffffff PCIe domain 2 > ... > > and the phys_addr_t I get on the CPU refers to this mapping. However, a > device attached to PCIe domain 1 would see > > 0000000000000000..000000000000ffff Legacy I/O in PCIe domain 1 > 0000000000010000..00000000000fffff Legacy VGA mappings > 0000000000100000..000000007fffffff 32-bit BARs in PCIe domain 1 > 0000000080000000..00000000ffffffff RAM (accessible to 32 bit devices) > 0000000100000000..0000001fffffffff RAM (requires 64 bit addressing) > 0000002000000000..000000207fffffff RAM (CPU physical address 0..2GB) > 0060000080000000..006001ffffffffff 64-bit BARs in PCIe domain 1 > 0060020000000000..006003ffffffffff PCIe domain 2 > > This allows 32 bit devices to access other 32 bit devices on the same > bus, and (some) physical memory, but we need to sacrifice the 1:1 > mapping for host memory. The actual mapping is a bit more complex, > because 64 bit BARs get mapped into the "32 bit" space to keep them > accessible for 32 bit cards in the same domain, and this would also be a > valid reason not to extend the BAR size even if we can. > > The default 256 MB aperture ends up in the "32 bit" range, so unless the > BAR is resized and reallocated, the CPU and DMA addresses for the > aperture *will* differ. > > So when a DMA buffer is created that ends up in the first 2 GB of RAM, > the dma_addr_t returned for this device will have 0x2000000000 added to > it, because that is the address that the device will have to use, and > DMA buffers for 32 bit devices will be taken from the 2GB..4GB range > because neither the first 2 GB nor anything beyond 4 GB are accessible > to this device. > > If there is a 32 bit BAR at 0x10000000 in domain 1, then the CPU will > see it at 0x60000010000000, but mapping it from another device in the > same domain will return a dma_addr_t of 0x10000000 -- because that is > the address that is routeable in the PCIe fabric, this is the BAR > address configured into the device so it will actually respond, and the > TLP will not leave the bus because it is downstream of the root bridge, > so it does not affect the physical RAM. > > Actual numbers will be different to handle even more corner cases and I > don't remember exactly how many zeroes are in each range, but you get > the idea -- and this is before we've even started creating virtual > machines with a different view of physical addresses. Thank you for taking the time to explain in detail how the memory map and PCI addressing mechanism works.
> > On Tue, Sep 23, 2025 at 06:01:34AM +0000, Kasireddy, Vivek wrote: > > > - The Xe Graphics driver running inside the Linux VM creates a buffer > > (Gnome Wayland compositor's framebuffer) in the VF's portion (or share) > > of the VRAM and this buffer is shared with Qemu. Qemu then requests > > vfio-pci driver to create a dmabuf associated with this buffer. > > That's a bit late. What is EFI supposed to do? If I understand your question correctly, what happens is the Guest VM's EFI/BIOS Boot/Kernel messages are all displayed via virtio-vga (which is included by default?) if it is added to the VM. And, the VF's VRAM does not get used until Gnome/Mutter compositor starts. So, until this point, all buffers are created from Guest VM's system memory only. Thanks, Vivek > > Simon
