Hi Simon,

> Subject: Re: [PATCH v4 1/5] PCI/P2PDMA: Don't enforce ACS check for device
> functions of Intel GPUs
> 
> Hi,
> 
> since I'm late to the party I'll reply to the entire thread in one go.
> 
> On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:
> 
> > I think using a PCI BAR Address works just fine in this case because the Xe
> > driver bound to PF on the Host can easily determine that it belongs to one
> > of the VFs and translate it into VRAM Address.
> 
> There are PCIe bridges that support address translation, and might apply
> different translations for different PASIDs, so this determination would
> need to walk the device tree on both guest and host in a way that does
> not confer trust to the guest or allows it to gain access to resources
> through race conditions.
> 
> The difficulty here is that you are building a communication mechanism
> that bypasses a trust boundary in the virtualization framework, so it
> becomes part of the virtualization framework. I believe we can avoid
> that to some extent by exchanging handles instead of raw pointers.
> 
> I can see the point in using the dmabuf API, because it integrates well
> with existing 3D APIs in userspace, although I don't quite understand
> what the VK_KHR_external_memory_dma_buf extension actually does,
> besides
> defining a flag bit -- it seems the heavy lifting is done by the
> VK_KHR_external_memory_fd extension anyway. But yes, we probably want
> the interface to be compatible to existing sharing APIs on the host side
> at least, to allow the guest's "on-screen" images to be easily imported.
> 
> There is some potential for a shortcut here as well, giving these
> buffers directly to the host's desktop compositor instead of having an
> application react to updates by copying the data from the area shared
> with the VF to the area shared between the application and the
> compositor -- that would also be a reason to remain close to the
> existing interface.
> 
> It's not entirely necessary for this interface to be a dma_buf, as long
> as we have a conversion between a file descriptor and a BO.  On the
> other hand, it may be desirable to allow re-exporting it as a dma_buf if
> we want to access it from another device as well.
> 
> I'm not sure that is a likely use case though, even the horrible
> contraption I'm building here that has a Thunderbolt device send data
> directly to VRAM does not require that, because the guest would process
> the data and then send a different buffer to the host. Still would be
> nice for completeness.
> 
> The other thing that seems to be looming on the horizon is that dma_buf
> is too limited for VRAM buffers, because once it's imported, it is
> pinned as well, but we'd like to keep it moveable (there was another
> thread on the xe mailing list about that). That might even be more
> important if we have limited BAR space, because then we might not want
> to make the memory accessible through the BAR unless imported by
> something that needs access through the BAR, which we've established the
> main use case doesn't (because it doesn't even need any kind of access).
> 
> I think passing objects between trust domains should take the form of an
> opaque handle that is not predictable, and refers to an internal data
> structure with the actual parameters (so we pass these internally as
> well, and avoid all the awkwardness of host and guest having different
> world views. It doesn't matter if that path is slow, it should only be
> used rather seldom (at VM start and when the VM changes screen
> resolution).
> 
> For VM startup, we probably want to provision guest "on-screen" memory
> and semaphores really early -- maybe it makes sense to just give each VF
> a sensible shared mapping like 16 MB (rounded up from 2*1080p*32bit) by
> default, and/or present a ROM with EFI and OpenFirmware drivers -- can
> VFs do that on current hardware?
> 
> On Tue, Sep 23, 2025 at 05:53:06AM +0000, Kasireddy, Vivek wrote:
> 
> > IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
> > to never expose VRAM Addresses and instead have BAR addresses as DMA
> > addresses when exporting dmabufs to other devices.
> 
> Yes, because that is how the other devices access that memory.
> 
> > The problem here is that the CPU physical (aka BAR Address) is only
> > usable by the CPU.
> 
> The address you receive from mapping a dma_buf for a particular device
> is not a CPU physical address, even if it is identical on pretty much
> all PC hardware because it is uncommon to configure the root bridge with
> a translation there.
> 
> On my POWER9 machine, the situation is a bit different: a range in the
> lower 4 GB is reserved for 32-bit BARs, the memory with those physical
> addresses is remapped so it appears after the end of physical RAM from
> the point of view of PCIe devices, and the 32 bit BARs appear at the
> base of the PCIe bus (after the legacy ports).
> 
> So, as an example (reality is a bit more complex :> ) the memory map
> might look like
> 
> 0000000000000000..0000001fffffffff    RAM
> 0060000000000000..006001ffffffffff    PCIe domain 1
> 0060020000000000..006003ffffffffff    PCIe domain 2
> ...
> 
> and the phys_addr_t I get on the CPU refers to this mapping. However, a
> device attached to PCIe domain 1 would see
> 
> 0000000000000000..000000000000ffff    Legacy I/O in PCIe domain 1
> 0000000000010000..00000000000fffff    Legacy VGA mappings
> 0000000000100000..000000007fffffff    32-bit BARs in PCIe domain 1
> 0000000080000000..00000000ffffffff    RAM (accessible to 32 bit devices)
> 0000000100000000..0000001fffffffff    RAM (requires 64 bit addressing)
> 0000002000000000..000000207fffffff    RAM (CPU physical address 0..2GB)
> 0060000080000000..006001ffffffffff    64-bit BARs in PCIe domain 1
> 0060020000000000..006003ffffffffff    PCIe domain 2
> 
> This allows 32 bit devices to access other 32 bit devices on the same
> bus, and (some) physical memory, but we need to sacrifice the 1:1
> mapping for host memory. The actual mapping is a bit more complex,
> because 64 bit BARs get mapped into the "32 bit" space to keep them
> accessible for 32 bit cards in the same domain, and this would also be a
> valid reason not to extend the BAR size even if we can.
> 
> The default 256 MB aperture ends up in the "32 bit" range, so unless the
> BAR is resized and reallocated, the CPU and DMA addresses for the
> aperture *will* differ.
> 
> So when a DMA buffer is created that ends up in the first 2 GB of RAM,
> the dma_addr_t returned for this device will have 0x2000000000 added to
> it, because that is the address that the device will have to use, and
> DMA buffers for 32 bit devices will be taken from the 2GB..4GB range
> because neither the first 2 GB nor anything beyond 4 GB are accessible
> to this device.
> 
> If there is a 32 bit BAR at 0x10000000 in domain 1, then the CPU will
> see it at 0x60000010000000, but mapping it from another device in the
> same domain will return a dma_addr_t of 0x10000000 -- because that is
> the address that is routeable in the PCIe fabric, this is the BAR
> address configured into the device so it will actually respond, and the
> TLP will not leave the bus because it is downstream of the root bridge,
> so it does not affect the physical RAM.
> 
> Actual numbers will be different to handle even more corner cases and I
> don't remember exactly how many zeroes are in each range, but you get
> the idea -- and this is before we've even started creating virtual
> machines with a different view of physical addresses.
Thank you for taking the time to explain in detail how the memory map and
PCI addressing mechanism works.

> 
> On Tue, Sep 23, 2025 at 06:01:34AM +0000, Kasireddy, Vivek wrote:
> 
> > - The Xe Graphics driver running inside the Linux VM creates a buffer
> > (Gnome Wayland compositor's framebuffer) in the VF's portion (or share)
> > of the VRAM and this buffer is shared with Qemu. Qemu then requests
> > vfio-pci driver to create a dmabuf associated with this buffer.
> 
> That's a bit late. What is EFI supposed to do?
If I understand your question correctly, what happens is the Guest VM's
EFI/BIOS Boot/Kernel messages are all displayed via virtio-vga (which is
included by default?) if it is added to the VM. And, the VF's VRAM does not
get used until Gnome/Mutter compositor starts. So, until this point, all
buffers are created from Guest VM's system memory only.

Thanks,
Vivek

> 
>    Simon

Reply via email to