On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Shameer Kolothum <shameerali.kolothum.th...@huawei.com>
> >Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
> >SMMUv3 to vfio-pci endpoints with iommufd
> >
> >Accelerated SMMUv3 is only useful when the device can take advantage of
> >the host's SMMUv3 in nested mode. To keep things simple and correct, we
> >only allow this feature for vfio-pci endpoint devices that use the iommufd
> >backend. We also allow non-endpoint emulated devices like PCI bridges and
> >root ports, so that users can plug in these vfio-pci devices.
> >
> >Another reason for this limit is to avoid problems with IOTLB
> >invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> >SID, making it difficult to trace the originating device. If we allowed
> >emulated endpoint devices, QEMU would have to invalidate both its own
> >software IOTLB and the host's hardware IOTLB, which could slow things
> >down.
> >
> >Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> >translation (S1+S2), their get_address_space() callback must return the
> >system address space to enable correct S2 mappings of guest RAM.
> >
> >So in short:
> > - vfio-pci devices return the system address space
> > - bridges and root ports return the IOMMU address space
> >
> >Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
> 
> So the translation result is a doorbell addr(gpa) for guest?
> IIUC, there should be a mapping between guest doorbell addr(gpa) to host
> doorbell addr(hpa) in stage2 page table? Where is this mapping setup?

Yes and yes.

On ARM, MSI is behind IOMMU. When 2-stage translation is enabled,
it goes through two stages as you understood.

There are a few ways to implement this, though the current kernel
only supports one solution, which is a hard-coded RMR (reserved
memory region).

The solution sets up a RMR region in the ACPI's IORT, which maps
the stage1 linearly, i.e. gIOVA=gPA.

The gPA=>hPA mappings in the stage-2 are done by the kernel that
polls an IOMMU_RESV_SW_MSI region defined in the kernel driver.

It's not the ideal solution, but it's the simplest to implement.

There are other ways to support this like a true 2-stage mapping
but they are still on the way.

For more details, please refer to this:
https://lore.kernel.org/all/cover.1740014950.git.nicol...@nvidia.com/

> >+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> >+{
> >+
> >+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> >+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
> >+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
> >+        return true;
> >+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
> >+        object_property_find(OBJECT(pdev), "iommufd"))) {
> 
> Will this always return true?

It won't if a vfio-pci device doesn't have the "iommufd" property?

> >+        *vfio_pci = true;
> >+        return true;
> >+    }
> >+    return false;

Then, it returns "false" here.

> > static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> >*opaque,
> >                                               int devfn)
> > {
> >+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> >     SMMUState *bs = opaque;
> >+    bool vfio_pci = false;
> >     SMMUPciBus *sbus;
> >     SMMUv3AccelDevice *accel_dev;
> >     SMMUDevice *sdev;
> >
> >+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> >+        error_report("Device(%s) not allowed. Only PCIe root complex
> >devices "
> >+                     "or PCI bridge devices or vfio-pci endpoint devices
> >with "
> >+                     "iommufd as backend is allowed with
> >arm-smmuv3,accel=on",
> >+                     pdev->name);
> >+        exit(1);
> 
> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?

Hotplug will unlikely be supported well, as it would introduce
too much complication.

With iommufd, a vIOMMU object is allocated per device (vfio). If
the device fd (cdev) is not yet given to the QEMU. It isn't able
to allocate a vIOMMU object when creating a VM.

While a vIOMMU object can be allocated at a later stage once the
device is hotplugged. But things like IORT mappings aren't able
to get refreshed since the OS is likely already booted. Even an
IOMMU capability sync via the hw_info ioctl will be difficult to
do at the runtime post the guest iommu driver's initialization.

I am not 100% sure. But I think Intel model could have a similar
problem if the guest boots with zero cold-plugged device and then
hot-plugs a PASID-capable device at the runtime, when the guest-
level IOMMU driver is already inited?

FWIW, Shameer's cover-letter has the following line:
 "At least one vfio-pci device must currently be cold-plugged to
  a PCIe root complex associated with arm-smmuv3,accel=on."

Perhaps there should be a similar highlight in this smmuv3-accel
file as well (@Shameer).

Nicolin

Reply via email to