Hi,

since I'm late to the party I'll reply to the entire thread in one go.

On Fri, Sep 19, 2025 at 06:22:45AM +0000, Kasireddy, Vivek wrote:

> I think using a PCI BAR Address works just fine in this case because the Xe
> driver bound to PF on the Host can easily determine that it belongs to one
> of the VFs and translate it into VRAM Address.

There are PCIe bridges that support address translation, and might apply
different translations for different PASIDs, so this determination would
need to walk the device tree on both guest and host in a way that does
not confer trust to the guest or allows it to gain access to resources
through race conditions.

The difficulty here is that you are building a communication mechanism
that bypasses a trust boundary in the virtualization framework, so it
becomes part of the virtualization framework. I believe we can avoid
that to some extent by exchanging handles instead of raw pointers.

I can see the point in using the dmabuf API, because it integrates well
with existing 3D APIs in userspace, although I don't quite understand
what the VK_KHR_external_memory_dma_buf extension actually does, besides
defining a flag bit -- it seems the heavy lifting is done by the
VK_KHR_external_memory_fd extension anyway. But yes, we probably want
the interface to be compatible to existing sharing APIs on the host side
at least, to allow the guest's "on-screen" images to be easily imported.

There is some potential for a shortcut here as well, giving these
buffers directly to the host's desktop compositor instead of having an
application react to updates by copying the data from the area shared
with the VF to the area shared between the application and the
compositor -- that would also be a reason to remain close to the
existing interface.

It's not entirely necessary for this interface to be a dma_buf, as long
as we have a conversion between a file descriptor and a BO.  On the
other hand, it may be desirable to allow re-exporting it as a dma_buf if
we want to access it from another device as well.

I'm not sure that is a likely use case though, even the horrible
contraption I'm building here that has a Thunderbolt device send data
directly to VRAM does not require that, because the guest would process
the data and then send a different buffer to the host. Still would be
nice for completeness.

The other thing that seems to be looming on the horizon is that dma_buf
is too limited for VRAM buffers, because once it's imported, it is
pinned as well, but we'd like to keep it moveable (there was another
thread on the xe mailing list about that). That might even be more
important if we have limited BAR space, because then we might not want
to make the memory accessible through the BAR unless imported by
something that needs access through the BAR, which we've established the
main use case doesn't (because it doesn't even need any kind of access).

I think passing objects between trust domains should take the form of an
opaque handle that is not predictable, and refers to an internal data
structure with the actual parameters (so we pass these internally as
well, and avoid all the awkwardness of host and guest having different
world views. It doesn't matter if that path is slow, it should only be
used rather seldom (at VM start and when the VM changes screen
resolution).

For VM startup, we probably want to provision guest "on-screen" memory
and semaphores really early -- maybe it makes sense to just give each VF
a sensible shared mapping like 16 MB (rounded up from 2*1080p*32bit) by
default, and/or present a ROM with EFI and OpenFirmware drivers -- can
VFs do that on current hardware?

On Tue, Sep 23, 2025 at 05:53:06AM +0000, Kasireddy, Vivek wrote:

> IIUC, it is a common practice among GPU drivers including Xe and Amdgpu
> to never expose VRAM Addresses and instead have BAR addresses as DMA
> addresses when exporting dmabufs to other devices.

Yes, because that is how the other devices access that memory.

> The problem here is that the CPU physical (aka BAR Address) is only
> usable by the CPU.

The address you receive from mapping a dma_buf for a particular device
is not a CPU physical address, even if it is identical on pretty much
all PC hardware because it is uncommon to configure the root bridge with
a translation there.

On my POWER9 machine, the situation is a bit different: a range in the
lower 4 GB is reserved for 32-bit BARs, the memory with those physical
addresses is remapped so it appears after the end of physical RAM from
the point of view of PCIe devices, and the 32 bit BARs appear at the
base of the PCIe bus (after the legacy ports).

So, as an example (reality is a bit more complex :> ) the memory map
might look like

0000000000000000..0000001fffffffff    RAM
0060000000000000..006001ffffffffff    PCIe domain 1
0060020000000000..006003ffffffffff    PCIe domain 2
...

and the phys_addr_t I get on the CPU refers to this mapping. However, a
device attached to PCIe domain 1 would see

0000000000000000..000000000000ffff    Legacy I/O in PCIe domain 1
0000000000010000..00000000000fffff    Legacy VGA mappings
0000000000100000..000000007fffffff    32-bit BARs in PCIe domain 1
0000000080000000..00000000ffffffff    RAM (accessible to 32 bit devices)
0000000100000000..0000001fffffffff    RAM (requires 64 bit addressing)
0000002000000000..000000207fffffff    RAM (CPU physical address 0..2GB)
0060000080000000..006001ffffffffff    64-bit BARs in PCIe domain 1
0060020000000000..006003ffffffffff    PCIe domain 2

This allows 32 bit devices to access other 32 bit devices on the same
bus, and (some) physical memory, but we need to sacrifice the 1:1
mapping for host memory. The actual mapping is a bit more complex,
because 64 bit BARs get mapped into the "32 bit" space to keep them
accessible for 32 bit cards in the same domain, and this would also be a
valid reason not to extend the BAR size even if we can.

The default 256 MB aperture ends up in the "32 bit" range, so unless the
BAR is resized and reallocated, the CPU and DMA addresses for the
aperture *will* differ.

So when a DMA buffer is created that ends up in the first 2 GB of RAM,
the dma_addr_t returned for this device will have 0x2000000000 added to
it, because that is the address that the device will have to use, and
DMA buffers for 32 bit devices will be taken from the 2GB..4GB range
because neither the first 2 GB nor anything beyond 4 GB are accessible
to this device.

If there is a 32 bit BAR at 0x10000000 in domain 1, then the CPU will
see it at 0x60000010000000, but mapping it from another device in the
same domain will return a dma_addr_t of 0x10000000 -- because that is
the address that is routeable in the PCIe fabric, this is the BAR
address configured into the device so it will actually respond, and the
TLP will not leave the bus because it is downstream of the root bridge,
so it does not affect the physical RAM.

Actual numbers will be different to handle even more corner cases and I
don't remember exactly how many zeroes are in each range, but you get
the idea -- and this is before we've even started creating virtual
machines with a different view of physical addresses.

On Tue, Sep 23, 2025 at 06:01:34AM +0000, Kasireddy, Vivek wrote:

> - The Xe Graphics driver running inside the Linux VM creates a buffer
> (Gnome Wayland compositor's framebuffer) in the VF's portion (or share)
> of the VRAM and this buffer is shared with Qemu. Qemu then requests
> vfio-pci driver to create a dmabuf associated with this buffer.

That's a bit late. What is EFI supposed to do?

   Simon

Reply via email to