[Qemu-devel] [PATCH] pci-devfn: check that device/slot number is within range
Need to check that guest slot/device number is not > 31 or walk off the devfn table when checking if a devfn is available or not in a guest. before this fix, passing in an addr=abc or addr=34, can crash qemu, sometimes fail gracefully if data past end of devfn table fails the availability test. with this fix, get clean error: Property 'pci-assign.addr' doesn't take value '34' also tested when no addr= param passed for guest (pcicfg) address, and that worked as well. Signed-off-by: Don Dutile --- hw/qdev-properties.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/hw/qdev-properties.c b/hw/qdev-properties.c index 7ce95b6..e0e54aa 100644 --- a/hw/qdev-properties.c +++ b/hw/qdev-properties.c @@ -524,6 +524,8 @@ static int parse_pci_devfn(DeviceState *dev, Property *prop, const char *str) return -EINVAL; if (fn > 7) return -EINVAL; +if (slot > 31) +return -EINVAL; *ptr = slot << 3 | fn; return 0; } -- 1.7.1
Re: [PATCH v3] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs
On 6/11/24 10:05 PM, Zhenyu Zhang wrote: Multiple warning messages and corresponding backtraces are observed when Linux guest is booted on the host with Fujitsu CPUs. One of them is shown as below. [0.032443] [ cut here ] [0.032446] uart-pl011 900.pl011: ARCH_DMA_MINALIGN smaller than CTR_EL0.CWG (128 < 256) [0.032454] WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:54 arch_setup_dma_ops+0xbc/0xcc [0.032470] Modules linked in: [0.032475] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-452.el9.aarch64 [0.032481] Hardware name: linux,dummy-virt (DT) [0.032484] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [0.032490] pc : arch_setup_dma_ops+0xbc/0xcc [0.032496] lr : arch_setup_dma_ops+0xbc/0xcc [0.032501] sp : 80008003b860 [0.032503] x29: 80008003b860 x28: x27: aae4b949049c [0.032510] x26: x25: x24: [0.032517] x23: 0100 x22: x21: [0.032523] x20: 0001 x19: 2f06c02ea400 x18: [0.032529] x17: 208a5f76 x16: 6589dbcb x15: aae4ba071c89 [0.032535] x14: x13: aae4ba071c84 x12: 455f525443206e61 [0.032541] x11: 68742072656c6c61 x10: 0029 x9 : aae4b7d21da4 [0.032547] x8 : 0029 x7 : 4c414e494d5f414d x6 : 0029 [0.032553] x5 : 000f x4 : aae4b9617a00 x3 : 0001 [0.032558] x2 : x1 : x0 : 2f06c029be40 [0.032564] Call trace: [0.032566] arch_setup_dma_ops+0xbc/0xcc [0.032572] of_dma_configure_id+0x138/0x300 [0.032591] amba_dma_configure+0x34/0xc0 [0.032600] really_probe+0x78/0x3dc [0.032614] __driver_probe_device+0x108/0x160 [0.032619] driver_probe_device+0x44/0x114 [0.032624] __device_attach_driver+0xb8/0x14c [0.032629] bus_for_each_drv+0x88/0xe4 [0.032634] __device_attach+0xb0/0x1e0 [0.032638] device_initial_probe+0x18/0x20 [0.032643] bus_probe_device+0xa8/0xb0 [0.032648] device_add+0x4b4/0x6c0 [0.032652] amba_device_try_add.part.0+0x48/0x360 [0.032657] amba_device_add+0x104/0x144 [0.032662] of_amba_device_create.isra.0+0x100/0x1c4 [0.032666] of_platform_bus_create+0x294/0x35c [0.032669] of_platform_populate+0x5c/0x150 [0.032672] of_platform_default_populate_init+0xd0/0xec [0.032697] do_one_initcall+0x4c/0x2e0 [0.032701] do_initcalls+0x100/0x13c [0.032707] kernel_init_freeable+0x1c8/0x21c [0.032712] kernel_init+0x28/0x140 [0.032731] ret_from_fork+0x10/0x20 [0.032735] ---[ end trace ]--- In Linux, a check is applied to every device which is exposed through device-tree node. The warning message is raised when the device isn't DMA coherent and the cache line size is larger than ARCH_DMA_MINALIGN (128 bytes). The cache line is sorted from CTR_EL0[CWG], which corresponds to 256 bytes on the guest CPUs. The DMA coherent capability is claimed through 'dma-coherent' in their device-tree nodes or parent nodes. Fix the issue by adding 'dma-coherent' property to the device-tree root node, meaning all devices are capable of DMA coherent by default. Signed-off-by: Zhenyu Zhang --- v3: Add comments explaining why we add 'dma-coherent' property (Peter) --- hw/arm/virt.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 3c93c0c0a6..3cefac6d43 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -271,6 +271,17 @@ static void create_fdt(VirtMachineState *vms) qemu_fdt_setprop_cell(fdt, "/", "#size-cells", 0x2); qemu_fdt_setprop_string(fdt, "/", "model", "linux,dummy-virt"); +/* + * For QEMU, all DMA is coherent. Advertising this in the root node + * has two benefits: + * + * - It avoids potential bugs where we forget to mark a DMA + * capable device as being dma-coherent + * - It avoids spurious warnings from the Linux kernel about + * devices which can't do DMA at all + */ +qemu_fdt_setprop(fdt, "/", "dma-coherent", NULL, 0); + /* /chosen must exist for load_dtb to fill in necessary properties later */ qemu_fdt_add_subnode(fdt, "/chosen"); if (vms->dtb_randomness) { +1 to Peter's suggested comment, otherwise, unless privy to this thread, one would wonder how/why. Reviewed-by: Donald Dutile
Re: nested-smmuv3 topic for QEMU/libvirt, Nov 2024
On 11/7/24 3:31 PM, Nicolin Chen wrote: Hi Eric, On Thu, Nov 07, 2024 at 12:11:05PM +0100, Eric Auger wrote: On 11/1/24 05:09, Nicolin Chen wrote: Hi, This is a continued discussion following previous month's: https://lore.kernel.org/qemu-devel/Zvr%2Fbf7KgLN1cjOl@Asurada-Nvidia/ Kernel changes are getting closer to merge, as Jason's planning to take vIOMMU series and smmuv3_nesting series into the iommufd tree: https://lore.kernel.org/all/cover.1730313494.git.nicol...@nvidia.com/ https://lore.kernel.org/all/cover.1730313494.git.nicol...@nvidia.com/ https://lore.kernel.org/all/0-v4-9e99b76f3518+3a8-smmuv3_nesting_...@nvidia.com/ So, I think it's probably a good time to align with each others and talk about kicking off some QEMU upstream work in the month ahead. @Shameer, Do you have some update on the pluggable smmuv3 module? Updates on my side: 1) I have kept uAPI updated to the latest version and verified too. There should be some polishing changes depending on how the basic nesting infrastructure would look like from Intel/Duan's work. 2) I got some help from NVIDIA folks for the libvirt task. And they have done some drafting and are now verifying the PCI topology with "iommu=none". Once the pluggable smmuv3 module is ready to test, we will make some change to libvirt for that and drop the auto-assigning patches from the VIRT code, so as to converge for a libvirt+QEMU test. FWIW, Robin requested a different solution for MSI mapping [1], v.s. the RMR one that we have been using since Eric's work. I drafted a few VFIO/IOMMUFD patches for that, yet paused for getting the vIOMMU series merged to this kernel cycle. I plan to continue in Nov/Dec. So, for the near term we will continue with the RMR solution, until we have something solid later. [1] https://lore.kernel.org/linux-iommu/ZrVN05VylFq8lK4q@Asurada-Nvidia/ At Red Hat we may find some cycles to resume working on the QEMU integration. Please can you sketch some tasks we could carry out in coordination with you and Shameer? Adding Don in the loop. That is great! So, given that Shameer is working on pluggable module part and we have folks working on libvirt. I think the only big thing here is the SMMUv3 series itself. Please refer to the changes in the link: - cover-letter: Add HW accelerated nesting support for arm SMMUv3 - https://github.com/nicolinc/qemu/commits/wip/for_smmuv3_nesting-v4/ I expect the IOMMU_HWPT_ALLOC (backend APIs) will go with Intel's series once their current "emulated devices" one gets merged. And I think I can prepare IOMMU_VIOMMU_ALLOC patches for backend APIs aligning with HWPT's. That being said, one thing I am not sure is how we should bridge between these backend APIs and a virtual IOMMUs (vSMMU/intel). I think it'd be better for you and Red Hat to provide some insight, if calling the backend APIs directly from a viommu module isn't a preferable way. We also need your comments on vSMMU module patches that are still roughly a draft requiring a rework at some small details I think. So, if your (and Don's) bandwidth allows, perhaps you could take over the vSMMU patches? Otherwise, we can also share the task for reworking. Thanks! Nicolin Nicolin, Hi! Sounds like a plan. Glad to join the effort. - Don
Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device
On 11/27/24 11:00 AM, Jason Gunthorpe wrote: On Wed, Nov 27, 2024 at 10:21:24AM +, Shameerali Kolothum Thodi wrote: For SMMUv3, NVIDIA-specific vCMDQ, it needs a parameter to state that specifically, since I'm concluding from reading the SMMUv3 version G.a spec, that ECMDQ was added to be able to assign an ECMDQ to a VM, Not sure the intention of ECMDQ as per that specification is to assign it to a VM. I think the main idea behind it is to have one Command Queue per host CPU to eliminate lock contention while submitting commands to SMMU. Right AFAIK it is not safe to assign one of the ECMDQ to guest yet. I think there is no way you can associate a VMID with ECMDQ. So there is no plan to support ARM ECMDQ now. Yep 'Yet' ... The association would be done via the VMM -- no different then what associates an assigned device to a VM today -- no hw-level (VM-)ID needed; a matter of exposing it to the VM, or not; or mapping the (virtual) CMDQ to the mapped/associated ECMDQ. They are purposedly mapped 64K apart from each other, enabling page-level protection, which I doubt is a per-CPU req for lock contention avoidance (large-cache-block spaced would be sufficient, even 4k; it's 64k spaced btwn ECMDQ regs .. the largest ARM page size. Summary: let's not assume this can't happen, and the chosen cmdline prevents it. NVIDIA VCMDQ is a completely vendor specific one. Perhaps ARM may come up with an assignable CMDQ in future though. Yes, it is easy to imagine an ECMDQ extension that provides the same HW features that VCMDQ has in future. I hope ARM will develop one. Right, so how to know what op is being "accel"'d wrt smmuv3. ... and all needs to be per-instance ... libvirt (or any other VMM orchestrator) will need to determine compatibility for live migration. e.g., can one live migrate an accel=nv-vcmdq-based VM to a host with accel=ecmdq support? only nv-vcmdq? what if there are version diffs of nv-vcmdq over time? -- apologies, but I don't know the minute details of nv-vcmdq to determine if that's unlikely or not. Yes. This require more thought. But our first aim is get the basic smmuv3-accel support. Yeah, there is no live migration support yet in the SMMU qmeu driver, AFAIK? When it gets done the supported options will have to be considered Jason
Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device
On 11/27/24 5:21 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Donald Dutile Sent: Tuesday, November 26, 2024 6:29 PM To: Nicolin Chen ; Eric Auger Cc: Shameerali Kolothum Thodi ; qemu-...@nongnu.org; qemu-devel@nongnu.org; peter.mayd...@linaro.org; j...@nvidia.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device On 11/13/24 1:05 PM, Nicolin Chen wrote: Hi Eric, On Wed, Nov 13, 2024 at 06:12:15PM +0100, Eric Auger wrote: On 11/8/24 13:52, Shameer Kolothum wrote: @@ -181,6 +181,7 @@ static const MemMapEntry base_memmap[] = { [VIRT_PVTIME] = { 0x090a, 0x0001 }, [VIRT_SECURE_GPIO] ={ 0x090b, 0x1000 }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, +[VIRT_SMMU_NESTED] ={ 0x0b00, 0x0100 }, I agree with Mostafa that the _NESTED terminology may not be the best choice. The motivation behind that multi-instance attempt, as introduced in https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/ was: - SMMUs with different feature bits - support of VCMDQ HW extension for SMMU CMDQ - need for separate S1 invalidation paths If I understand correctly this is mostly wanted for VCMDQ handling? if this is correct we may indicate that somehow in the terminology. If I understand correctly VCMDQ terminology is NVidia specific while ECMDQ is the baseline (?). VCMDQ makes a multi-vSMMU-instance design a hard requirement, yet the point (3) for separate invalidation paths also matters. Jason suggested VMM in base case to create multi vSMMU instances as the kernel doc mentioned here: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux- next.git/tree/Documentation/userspace-api/iommufd.rst#n84 W.r.t naming, maybe something related to "hardware-accelerated"? Given that 'accel' has been used for hw-acceleration elsewhere, that seems like a reasonable 'mode'. But, it needs a paramater to state was is being accelerated. i.e., the more global 'accel=kvm' has 'kvm'. I was thinking more like calling this hw accelerated nested SMMUv3 emulation as 'smmuv3-accel'. This avoids confusion with the already existing 'iommu=smmuv3' that also has a nested emulation support. ie, -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \ I -think- you are saying below, that we have to think a bit more about this device tagging. I'm thinking more like - device arm-smmuv3,accel=,id=smmu1,bus=pcie.1 \ For SMMUv3, NVIDIA-specific vCMDQ, it needs a parameter to state that specifically, since I'm concluding from reading the SMMUv3 version G.a spec, that ECMDQ was added to be able to assign an ECMDQ to a VM, Not sure the intention of ECMDQ as per that specification is to assign it to a VM. I think the main idea behind it is to have one Command Queue per host CPU to eliminate lock contention while submitting commands to SMMU. AFAIK it is not safe to assign one of the ECMDQ to guest yet. I think there is no way you can associate a VMID with ECMDQ. So there is no plan to support ARM ECMDQ now. NVIDIA VCMDQ is a completely vendor specific one. Perhaps ARM may come up with an assignable CMDQ in future though. and let the VM do CMDQ driven invalidations via a similar mechanism as assigned PCI-device mmio space in a VM. So, how should the QEMU invocation select what parts to 'accel' in the vSMMUv3 given to the VM? ... and given the history of hw-based, virt-acceleration, I can only guess more SMMUv3 accel tweaks will be found/desired/implemented. So, given there is an NVIDIA-specific/like ECMDQ, but different, the accel parameter chosen has to consider 'name-space collision', i.e., accel=nv-vcmdq and accel=ecmdq, unless sw can be made to smartly probe and determine the underlying diffs, and have equivalent functionality, in which case, a simpler 'accel=vcmdq' could be used. Yep. Probably we could abstract that from the user and handle it within Qemu when the kernel reports the capability based on physical SMMUv3. Finally, wrt libvirt, how does it know/tell what can and should be used? For ECMDQ, something under sysfs for an SMMUv3 could expose its presence/capability/availability (tag for use/alloc'd for a VM), or an ioctl/cdev i/f to the SMMUv3. But how does one know today that there's NVIDIA-vCMDQ support on its SMMUv3? -- is it exposed in sysfs, ioctl, cdev? I think the capability will be reported through a IOCTL. Nicolin ? ... and all needs to be per-instance ... libvirt (or any other VMM orchestrator) will need to determine compatibility for live migration. e.g., can one live migrate an accel=nv-vcmdq-based VM to a host with accel=ecmdq support? only nv-vcmdq? what if there are version diffs of nv-vcmdq over time? -- apo
Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device
On 11/13/24 1:05 PM, Nicolin Chen wrote: Hi Eric, On Wed, Nov 13, 2024 at 06:12:15PM +0100, Eric Auger wrote: On 11/8/24 13:52, Shameer Kolothum wrote: @@ -181,6 +181,7 @@ static const MemMapEntry base_memmap[] = { [VIRT_PVTIME] = { 0x090a, 0x0001 }, [VIRT_SECURE_GPIO] ={ 0x090b, 0x1000 }, [VIRT_MMIO] = { 0x0a00, 0x0200 }, +[VIRT_SMMU_NESTED] ={ 0x0b00, 0x0100 }, I agree with Mostafa that the _NESTED terminology may not be the best choice. The motivation behind that multi-instance attempt, as introduced in https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/ was: - SMMUs with different feature bits - support of VCMDQ HW extension for SMMU CMDQ - need for separate S1 invalidation paths If I understand correctly this is mostly wanted for VCMDQ handling? if this is correct we may indicate that somehow in the terminology. If I understand correctly VCMDQ terminology is NVidia specific while ECMDQ is the baseline (?). VCMDQ makes a multi-vSMMU-instance design a hard requirement, yet the point (3) for separate invalidation paths also matters. Jason suggested VMM in base case to create multi vSMMU instances as the kernel doc mentioned here: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/userspace-api/iommufd.rst#n84 W.r.t naming, maybe something related to "hardware-accelerated"? Given that 'accel' has been used for hw-acceleration elsewhere, that seems like a reasonable 'mode'. But, it needs a paramater to state was is being accelerated. i.e., the more global 'accel=kvm' has 'kvm'. For SMMUv3, NVIDIA-specific vCMDQ, it needs a parameter to state that specifically, since I'm concluding from reading the SMMUv3 version G.a spec, that ECMDQ was added to be able to assign an ECMDQ to a VM, and let the VM do CMDQ driven invalidations via a similar mechanism as assigned PCI-device mmio space in a VM. So, how should the QEMU invocation select what parts to 'accel' in the vSMMUv3 given to the VM? ... and given the history of hw-based, virt-acceleration, I can only guess more SMMUv3 accel tweaks will be found/desired/implemented. So, given there is an NVIDIA-specific/like ECMDQ, but different, the accel parameter chosen has to consider 'name-space collision', i.e., accel=nv-vcmdq and accel=ecmdq, unless sw can be made to smartly probe and determine the underlying diffs, and have equivalent functionality, in which case, a simpler 'accel=vcmdq' could be used. Finally, wrt libvirt, how does it know/tell what can and should be used? For ECMDQ, something under sysfs for an SMMUv3 could expose its presence/capability/availability (tag for use/alloc'd for a VM), or an ioctl/cdev i/f to the SMMUv3. But how does one know today that there's NVIDIA-vCMDQ support on its SMMUv3? -- is it exposed in sysfs, ioctl, cdev? ... and all needs to be per-instance ... libvirt (or any other VMM orchestrator) will need to determine compatibility for live migration. e.g., can one live migrate an accel=nv-vcmdq-based VM to a host with accel=ecmdq support? only nv-vcmdq? what if there are version diffs of nv-vcmdq over time? -- apologies, but I don't know the minute details of nv-vcmdq to determine if that's unlikely or not. Once the qemu-smmuv3-api is defined, with the recognition of what libvirt (or any other VMM) needs to probe/check/use for hw-accelerated features, I think it'll be more straight-fwd to implement, and (clearly) understand from a qemu command line. :) Thanks, - Don Thanks Nicolin
Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device
On 11/28/24 7:54 AM, Jason Gunthorpe wrote: On Wed, Nov 27, 2024 at 08:44:47PM -0800, Nicolin Chen wrote: On Wed, Nov 27, 2024 at 11:29:06PM -0500, Donald Dutile wrote: On 11/27/24 5:21 AM, Shameerali Kolothum Thodi wrote: W.r.t naming, maybe something related to "hardware-accelerated"? Given that 'accel' has been used for hw-acceleration elsewhere, that seems like a reasonable 'mode'. But, it needs a paramater to state was is being accelerated. i.e., the more global 'accel=kvm' has 'kvm'. I was thinking more like calling this hw accelerated nested SMMUv3 emulation as 'smmuv3-accel'. This avoids confusion with the already existing 'iommu=smmuv3' that also has a nested emulation support. ie, -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \ .. I -think- you are saying below, that we have to think a bit more about this device tagging. I'm thinking more like - device arm-smmuv3,accel=,id=smmu1,bus=pcie.1 \ I wonder if we really need a "vcmdq" enabling/disabling option? Jason's suggested approach for a vIOMMU allocation is to retry- on-failure, so my draft patches allocate a TEGRA241_CMDQV type of vIOMMU first, and then fall back to a regular SMMUV3 type if it fails. So, a host that doesn't have a VCMDQ capability could still work with the fallback/default pathway. It needs to be configurable so the VM can be configured in a consistent way across nodes Yes. To expound further, one wants to be able to define an 'acceptable' VM criteria, so libvirt (or OpenStack?) can find and generate the list of 'acceptable nodes', priori typically, that can be a match for the acceptance criteria. Conversely, if one specifies a set of systems that one wants to be able to migrate across, then libvirt needs to find and select/set the features|attributes that enable the VM to migrate in a compatible way. autodetection of host features is nice for experimenting but scale deployments should precisely specify every detail about the VM and not rely on host detection. Otherwise the VM instance type will be ill defined.. Jason
Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested SMMUv3
On 12/13/24 8:19 AM, Daniel P. Berrangé wrote: On Fri, Dec 13, 2024 at 08:46:42AM -0400, Jason Gunthorpe wrote: On Fri, Dec 13, 2024 at 12:00:43PM +, Daniel P. Berrangé wrote: On Fri, Nov 08, 2024 at 12:52:37PM +, Shameer Kolothum via wrote: Hi, This series adds initial support for a user-creatable "arm-smmuv3-nested" device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine and cannot support multiple SMMUv3s. In order to support vfio-pci dev assignment with vSMMUv3, the physical SMMUv3 has to be configured in nested mode. Having a pluggable "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests running on a host with multiple physical SMMUv3s. A few benefits of doing this are, I'm not very familiar with arm, but from this description I'm not really seeing how "nesting" is involved here. You're only talking about the host and 1 L1 guest, no L2 guest. nesting is the term the iommu side is using to refer to the 2 dimensional paging, ie a guest page table on top of a hypervisor page table. Nothing to do with vm nesting. Ok, that naming is destined to cause confusion for many, given the commonly understood use of 'nesting' in the context of VMs... Also what is the relation between the physical SMMUv3 and the guest SMMUv3 that's referenced ? Is this in fact some form of host device passthrough rather than nesting ? It is an acceeleration feature, the iommu HW does more work instead of the software emulating things. Similar to how the 2d paging option in KVM is an acceleration feature. All of the iommu series on vfio are creating paravirtualized iommu models inside the VM. They access various levels of HW acceleration to speed up the paravirtualization. ... describing it as a HW accelerated iommu makes it significantly clearer to me what this proposal is about. Perhaps the device is better named as "arm-smmuv3-accel" ? I'm having deja-vu! ;-) Thanks for echo-ing my earlier statements in this patch series about the use of 'nested'. and the better use of 'accel' in these circumstances. Even 'accel' on an 'arm-smmuv3' is a bit of a hammer, as there can be multiple accel's features &/or implementations... I would like to see the 'accel' as a parameter to 'arm-smmuv3', and not a complete name-space onto itself, so we can do things like 'accel=cmdvq', accel='2-level', ... and for libvirt's sanity, a way to get those hw features from sysfs for (possible) migration-compatibility testing. With regards, Daniel
Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested SMMUv3
Nicolin, Hi! On 1/8/25 11:45 PM, Nicolin Chen wrote: On Mon, Dec 16, 2024 at 10:01:29AM +, Shameerali Kolothum Thodi wrote: And patches prior to this commit adds that support: 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm SMMUv3") Nicolin is soon going to send out those for review. Or I can include those in this series so that it gives a complete picture. Nicolin? Just found that I forgot to reply this one...sorry I asked Don/Eric to take over that vSMMU series: https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/ (The majority of my effort has been still on the kernel side: previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..) Don/Eric, is there any update from your side? Apologies for delayed response, been at customer site, and haven't been keeping up w/biz email. Eric is probably waiting for me to get back and chat as well. Will look to reply early next week. - Don I think it's also a good time to align with each other so we can take our next step in the new year :) Thanks Nicolin
Re: [RFC PATCH v2 13/20] hw/arm/smmuv3-accel: Introduce helpers to batch and issue cache invalidations
On 3/19/25 5:48 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Donald Dutile Sent: Wednesday, March 19, 2025 1:31 AM To: Shameerali Kolothum Thodi ; qemu-...@nongnu.org; qemu-devel@nongnu.org Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [RFC PATCH v2 13/20] hw/arm/smmuv3-accel: Introduce helpers to batch and issue cache invalidations Shameer, Hi, On 3/11/25 10:10 AM, Shameer Kolothum wrote: From: Nicolin Chen Inroduce an SMMUCommandBatch and some helpers to batch and issue the Introduce commands. Currently separate out TLBI commands and device cache commands to avoid some errata on certain versions of SMMUs. Later it should check IIDR register to detect if underlying SMMU hw has such an erratum. Where is all this info about 'certain versions of SMMUs' and 'check IIDR register' has something to do with 'underlying SMMU hw such an erratum', -- which IIDR (& bits)? or are we talking about rsvd SMMU_IDR<> registers? I guess the batching has constraints on some platforms, IIRC, this was discussed somewhere in a kernel thread. Nicolin, could you please provide some background on this. A lore link if it's discussed upstream, thanks. And can't these helpers be used for emulated smmuv3 as well as accelerated? Could be I guess. But no benefit in terms of performance. May be will make code look nicer. I will take a look if not much of changes in the emulated path. Thanks for looking into it. The push is to use common code path(s) to invoke (or not) an accel path. Thanks, Shameer
Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device
On 3/18/25 3:13 PM, Nicolin Chen wrote: On Tue, Mar 18, 2025 at 07:31:36PM +0100, Eric Auger wrote: On 3/17/25 9:19 PM, Nicolin Chen wrote: On Mon, Mar 17, 2025 at 04:24:53PM -0300, Jason Gunthorpe wrote: On Mon, Mar 17, 2025 at 12:10:19PM -0700, Nicolin Chen wrote: Another question: how does an emulated device work with a vSMMUv3? I could imagine that all the accel steps would be bypassed since !sdev->idev. Yet, the emulated iotlb should cache its translation so we will need to flush the iotlb, which will increase complexity as the TLBI command dispatching function will need to be aware what ASID is for emulated device and what is for vfio device.. I think you should block it. We already expect different vSMMU's depending on the physical SMMU under the PCI device, it makes sense that a SW VFIO device would have it's own, non-accelerated, vSMMU model in the guest. Yea, I agree and it'd be cleaner for an implementation separating them. In my mind, the general idea of "accel=on" is also to keep things in a more efficient way: passthrough devices go to HW-accelerated vSMMUs (separated PCIE buses), while emulated ones go to a vSMMU- bypassed (PCIE0). Originally a specific SMMU device was needed to opt in for MSI reserved region ACPI IORT description which are not needed if you don't rely on S1+S2. However if we don't rely on this trick this was not even needed with legacy integration (https://patchwork.kernel.org/project/qemu-devel/cover/20180921081819.9203-1-eric.au...@redhat.com/). Nevertheless I don't think anything prevents the acceleration granted device from also working with virtio/vhost devices for instance unless you unplug the existing infra. The translation and invalidation just should use different control paths (explicit translation requests, invalidation notifications towards vhost, ...). smmuv3_translate() is per sdev, so it's easy. Invalidation is done via commands, which could be tricky: a) Broadcast command b) ASID validation -- we'll need to keep track of a list of ASIDs for vfio device to compare the ASID in each per-ASID command, potentially by trapping all CFGI_CD(_ALL) commands? Note that each vfio device may have multiple ASIDs (for multiple CDs). Either a or b above will have some validation efficiency impact. Again, what does legitimate to have different qemu devices for the same IP? I understand that it simplifies the implementation but I am not sure this is a good reason. Nevertheless it worth challenging. What is the plan for intel iommu? Will we have 2 devices, the legacy device and one for nested? Hmm, it seems that there are two different topics: 1. Use one SMMU device model (source code file; "iommu=" string) for both an emulated vSMMU and an HW-accelerated vSMMU. 2. Allow one vSMMU instance to work with both an emulated device and a passthrough device. And I get that you want both 1 and 2. I'm totally okay with 1, yet see no compelling benefit from 2 for the increased complexity in the invalidation routine. And another question about the mixed device attachment. Let's say we have in the host: VFIO passthrough dev0 -> pSMMU0 VFIO passthrough dev1 -> pSMMU1 Should we allow emulated devices to be flexibly plugged? dev0 -> vSMMU0 /* Hard requirement */ dev1 -> vSMMU1 /* Hard requirement */ emu0 -> vSMMU0 /* Soft requirement; can be vSMMU1 also */ emu1 -> vSMMU1 /* Soft requirement; can be vSMMU0 also */ Thanks Nicolin I agree w/Jason & Nicolin: different vSMMUs for pass-through devices than emulated, & vice-versa. Not mixing... because... of the next agreement: I agree with Eric that 'accel' isn't needed -- this should be ascertained from the pSMMU that a physical device is attached to. Now... how does vfio(?; why not qemu?) layer determine that? -- where are SMMUv3 'accel' features exposed either: a) in the device struct (for the smmuv3) or (b) somewhere under sysfs? ... I couldn't find anything under either on my g-h system, but would appreciate a ptr if there is. and like Eric, although 'accel' is better than the original 'nested', it's non-obvious what accel feature(s) are being turned on, or not. In fact, if broken accel hw occurs ('if' -> 'when'), how should it be turned off? ... if info in the kernel, a kernel boot-param will be needed; if in sysfs, a write to 0 an enable(disable) it maybe an alternative as well. Bottom line: we need a way to (a) ascertain the accel feature (b) a way to disable it when it is broken, so qemu's smmuv3 spec will 'just work'. [This may also help when migrating from a machine that has accel working to one that does not.[ ... and when an emulated device is assigned a vSMMU, there are no accel features ... unless we have tunables like batch iotlb invalidation for perf reasons, which can be viewed as an 'accel' option.
Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device
On 3/17/25 3:24 PM, Jason Gunthorpe wrote: On Mon, Mar 17, 2025 at 12:10:19PM -0700, Nicolin Chen wrote: Another question: how does an emulated device work with a vSMMUv3? I could imagine that all the accel steps would be bypassed since !sdev->idev. Yet, the emulated iotlb should cache its translation so we will need to flush the iotlb, which will increase complexity as the TLBI command dispatching function will need to be aware what ASID is for emulated device and what is for vfio device.. I think you should block it. We already expect different vSMMU's ... and when you say 'block', you mean qemu prints out a helpful message like "Mixing emulate/virtual devices and physical devices on a single SMMUv3 is not allowed. Specify separate smmuv3 objects for each type of device; multiple smmuv3 objects may be required for each physical device if they are attached to different smmuv3's in the host system." Or would that be an allowed qemu machine definition, but the 'block' would be a warning like: "Mixing emulated/virtual devices and physical devices on a single SMMUv3 is not recommended for performance reasons. To yield optimal performance, place physical devices on separate SMMUv3 objects than emulated/virtual device SMMUv3 objects." ... and in this case, the physical devices would not use the accel features of an smmuv3, but still be 'functional'. This may be desired for a machine definition that wants to be used on different hosts that may not have the (same) accel feature(s). depending on the physical SMMU under the PCI device, it makes sense that a SW VFIO device would have it's own, non-accelerated, vSMMU model in the guest. Jason
Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus
Shameer, Hi! On 3/11/25 10:10 AM, Shameer Kolothum wrote: User must associate a pxb-pcie root bus to smmuv3-accel and that is set as the primary-bus for the smmu dev. Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index c327661636..1471b65374 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -9,6 +9,21 @@ #include "qemu/osdep.h" #include "hw/arm/smmuv3-accel.h" +#include "hw/pci/pci_bridge.h" + +static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque) +{ +DeviceState *d = opaque; + +if (object_dynamic_cast(obj, "pxb-pcie-bus")) { +PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus; +if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus->name)) { +object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus), + &error_abort); +} +} +return 0; +} static void smmu_accel_realize(DeviceState *d, Error **errp) { @@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error **errp) SysBusDevice *dev = SYS_BUS_DEVICE(d); Error *local_err = NULL; +object_child_foreach_recursive(object_get_root(), + smmuv3_accel_pxb_pcie_bus, d); + object_property_set_bool(OBJECT(dev), "accel", true, &error_abort); c->parent_realize(d, &local_err); if (local_err) { @@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass *klass, void *data) device_class_set_parent_realize(dc, smmu_accel_realize, &c->parent_realize); dc->hotpluggable = false; +dc->bus_type = TYPE_PCIE_BUS; } static const TypeInfo smmuv3_accel_type_info = { I am not seeing the need for a pxb-pcie bus(switch) introduced for each 'accel'. Isn't the IORT able to define different SMMUs for different RIDs? if so, itsn't that sufficient to associate (define) an SMMU<->RID association without introducing a pxb-pcie? and again, I'm not sure how that improves/enables the device<->SMMU associativity? Feel free to enlighten me where I may have mis-read/interpreted the IORT & SMMUv3 specs. Thanks, - Don
Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device
Jason, Hey! On 3/18/25 8:23 PM, Jason Gunthorpe wrote: On Tue, Mar 18, 2025 at 05:22:51PM -0400, Donald Dutile wrote: I agree with Eric that 'accel' isn't needed -- this should be ascertained from the pSMMU that a physical device is attached to. I seem to remember the point was made that we don't actually know if accel is possible, or desired, especially in the case of hotplug. In the case of hw-passthrough hot-plug, what isn't known?: a) domain:b:d.f is known b) thus its hierarchy and SMMUv3 association in the host is known c) thus, if the (accel) features of the SMMUv3 were exposed (known), then the proper setup (separate SMMUv3 vs system-wide-emulated SMMUv3; association of (allocated/configured) vSMMUv3 to pSMMUv3 would be known/made What else is missing? The accelerated mode has a number of limitations that the software mode does not have. I think it does make sense that the user would deliberately choose to use a more restrictive operating mode and then would have to meet the requirements - eg by creating the required number and configuration of vSMMUs. At a qemu-cmd level, the right number & config of smmuv3's, but libvirt, if it had the above info, could auto-generate the right number of smmuv3's (stages, accel-features, etc.) ... just as it does today in creating the right number of pcie bus's, RPs, etc. from simple(r) device specs into more complete, qemu configs. Now... how does vfio(?; why not qemu?) layer determine that? -- where are SMMUv3 'accel' features exposed either: a) in the device struct (for the smmuv3) or (b) somewhere under sysfs? ... I couldn't find anything under either on my g-h system, but would appreciate a ptr if there is. I think it is not discoverable yet other thatn through try-and-fail. Discoverability would probably be some bits in an iommufd GET_INFO ioctl or something like that. I don't see how iommufd would 'get-info' the needed info any better than any other interface/subsystem. ... and like Eric, although 'accel' is better than the original 'nested', it's non-obvious what accel feature(s) are being turned on, or not. There are really only one accel feature - direct HW usage of the IO Page table in the guest (no shadowing). A secondary addon would be direct HW usage of an invalidation queue in the guest. and, if architected correctly, even in (device-specific) sw-provided tables, it could be 'formatted' in a way that it was discoverable by the appropriate layers (libvirt, qemu). Once discoverable, this whole separate accel device -- which is really an attribute of an SMMUv3 -- can be generalized, and reduced, to a much smaller, simpler, sw footprint, with the concept of callbacks (as the series uses) to enable hw accelerators to perform the shadow-ops that fully-emulated smmuv3 would have to do. kernel boot-param will be needed; if in sysfs, a write to 0 an enable(disable) it maybe an alternative as well. Bottom line: we need a way to (a) ascertain the accel feature (b) a way to disable it when it is broken, so qemu's smmuv3 spec will 'just work'. You'd turned it off by not asking qemu to use it, that is sort of the reasoning behind the command line opt in for accel or not. It would make machine-level definitions far more portable if the working/non-working, and the one-accel, or two-accel, or three-accel, or ... features were dynamically determined vs a static (qemu) machine config, that would have to be manipulated each time it ran on a different machine. e.g., cluster sw scans servers for machines with device-X. create VMs, assigning some/all of device-X to a VM via its own smmuv3. done. Now, if the smmuv3 features were exposed all the way up to userspace, then one could argue the cluster sw could scan for those features and add it to the accel=x,y,z option of the smmuv3 associated with an assigned device. potato/po-tah-toe cluster sw or libvirt or qemu or scans/reads ... discoverability of the features has to be done by (a) a computer, or (b) an error-prone human. ... all that AI gone to waste ... ;-) - Don Jason
Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device
On 3/19/25 2:09 PM, Eric Auger wrote: Hi Nicolin, On 3/19/25 6:14 PM, Nicolin Chen wrote: On Wed, Mar 19, 2025 at 05:45:51PM +0100, Eric Auger wrote: On 3/17/25 8:10 PM, Nicolin Chen wrote: On Mon, Mar 17, 2025 at 07:07:52PM +0100, Eric Auger wrote: On 3/17/25 6:54 PM, Nicolin Chen wrote: On Wed, Mar 12, 2025 at 04:15:10PM +0100, Eric Auger wrote: On 3/11/25 3:10 PM, Shameer Kolothum wrote: Based on SMMUv3 as a parent device, add a user-creatable smmuv3-accel device. In order to support vfio-pci dev assignment with a Guest guest SMMUv3, the physical SMMUv3 has to be configured in nested(S1+s2) nested (s1+s2) mode, with Guest owning the S1 page tables. Subsequent patches will the guest add support for smmuv3-accel to provide this. Can't this -accel smmu also works with emulated devices? Do we want an exclusive usage? Is there any benefit from emulated devices working in the HW- accelerated nested translation mode? Not really but do we have any justification for using different device name in accel mode? I am not even sure that accel option is really needed. Ideally the qemu device should be able to detect it is protecting a VFIO device, in which case it shall check whether nested is supported by host SMMU and then automatically turn accel mode? I gave the example of the vfio device which has different class implementration depending on the iommufd option being set or not. Do you mean that we should just create a regular smmuv3 device and let a VFIO device to turn on this smmuv3's accel mode depending on its LEGACY/IOMMUFD class? no this is not what I meant. I gave an example where depending on an option passed to thye VFIO device you choose one class implement or the other. Option means something like this: -device smmuv3,accel=on instead of -device "smmuv3-accel" ? Yea, I think that's good. Yeah actually that's a big debate for not much. From an implementation pov that shall not change much. The only doubt I have is if we need to conditionnaly expose the MSI RESV regions it is easier to do if we detect we have a smmuv3-accel. what the option allows is the auto mode. Another question: how does an emulated device work with a vSMMUv3? I don't get your question. vSMMUv3 currently only works with emulated devices. Did you mean accelerated SMMUv3? Yea. If "accel=on", how does an emulated device work with that? I could imagine that all the accel steps would be bypassed since !sdev->idev. Yet, the emulated iotlb should cache its translation so we will need to flush the iotlb, which will increase complexity as the TLBI command dispatching function will need to be aware what ASID is for emulated device and what is for vfio device.. I don't get the issue. For emulated device you go through the usual translate path which indeed caches configs and translations. In case the guest invalidates something, you know the SID and you find the entries in the cache that are tagged by this SID. In case you have an accelerated device (indeed if sdev->idev) you don't exercise that path. On invalidation you detect the SID matches a VFIO devoce, propagate the invalidations to the host instead. on the invalidation you should be able to detect pretty easily if you need to flush the emulated caches or propagate the invalidations. Do I miss some extra problematic? I do not say we should support emulated devices and VFIO devices in the same guest iommu group. But I don't see why we couldn't easily plug the accelerated logic in the current logical for emulation/vhost and do not require a different qemu device. Hmm, feels like I fundamentally misunderstood your point. a) We implement the device model with the same piece of code but only provide an option "accel=on/off" to switch mode. And both passthrough devices and emulated devices can attach to the same "accel=on" device. I think we all agree we don't want that use case in general. However effectively I was questioning why it couldn't work maybe at the expense of some perf degration. b) We implement the device model with the same piece of code but only provide an option "accel=on/off" to switch mode. Then, an passthrough device can attach to an "accel=on" device, but an emulated device can only attach to an "accel=off" SMMU device. I was thinking that you want case (a). But actually you were just talking about case (b)? I think (b) is totally fine. We certainly can't do case (a): not all TLBI commands gives an "SID" field (so would have to broadcast, i.e. underlying SMMU HW would run commands that were supposed for emulated devices only); in case of vCMDQ, commands for emulated devices would be issued to real HW and I am still confused about that. For instance if the guest sends an NH_ASID, NH_VA invalidation and it happens both the emulated device and VFIO-device share the same cd.asid (same guest iommu domain, which practically should not happen) why shouldn't we propagate the it can't ...
Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device
On 3/19/25 1:04 PM, Eric Auger wrote: On 3/18/25 10:22 PM, Donald Dutile wrote: On 3/18/25 3:13 PM, Nicolin Chen wrote: On Tue, Mar 18, 2025 at 07:31:36PM +0100, Eric Auger wrote: On 3/17/25 9:19 PM, Nicolin Chen wrote: On Mon, Mar 17, 2025 at 04:24:53PM -0300, Jason Gunthorpe wrote: On Mon, Mar 17, 2025 at 12:10:19PM -0700, Nicolin Chen wrote: Another question: how does an emulated device work with a vSMMUv3? I could imagine that all the accel steps would be bypassed since !sdev->idev. Yet, the emulated iotlb should cache its translation so we will need to flush the iotlb, which will increase complexity as the TLBI command dispatching function will need to be aware what ASID is for emulated device and what is for vfio device.. I think you should block it. We already expect different vSMMU's depending on the physical SMMU under the PCI device, it makes sense that a SW VFIO device would have it's own, non-accelerated, vSMMU model in the guest. Yea, I agree and it'd be cleaner for an implementation separating them. In my mind, the general idea of "accel=on" is also to keep things in a more efficient way: passthrough devices go to HW-accelerated vSMMUs (separated PCIE buses), while emulated ones go to a vSMMU- bypassed (PCIE0). Originally a specific SMMU device was needed to opt in for MSI reserved region ACPI IORT description which are not needed if you don't rely on S1+S2. However if we don't rely on this trick this was not even needed with legacy integration (https://patchwork.kernel.org/project/qemu-devel/cover/20180921081819.9203-1-eric.au...@redhat.com/). Nevertheless I don't think anything prevents the acceleration granted device from also working with virtio/vhost devices for instance unless you unplug the existing infra. The translation and invalidation just should use different control paths (explicit translation requests, invalidation notifications towards vhost, ...). smmuv3_translate() is per sdev, so it's easy. Invalidation is done via commands, which could be tricky: a) Broadcast command b) ASID validation -- we'll need to keep track of a list of ASIDs for vfio device to compare the ASID in each per-ASID command, potentially by trapping all CFGI_CD(_ALL) commands? Note that each vfio device may have multiple ASIDs (for multiple CDs). Either a or b above will have some validation efficiency impact. Again, what does legitimate to have different qemu devices for the same IP? I understand that it simplifies the implementation but I am not sure this is a good reason. Nevertheless it worth challenging. What is the plan for intel iommu? Will we have 2 devices, the legacy device and one for nested? Hmm, it seems that there are two different topics: 1. Use one SMMU device model (source code file; "iommu=" string) for both an emulated vSMMU and an HW-accelerated vSMMU. 2. Allow one vSMMU instance to work with both an emulated device and a passthrough device. And I get that you want both 1 and 2. I'm totally okay with 1, yet see no compelling benefit from 2 for the increased complexity in the invalidation routine. And another question about the mixed device attachment. Let's say we have in the host: VFIO passthrough dev0 -> pSMMU0 VFIO passthrough dev1 -> pSMMU1 Should we allow emulated devices to be flexibly plugged? dev0 -> vSMMU0 /* Hard requirement */ dev1 -> vSMMU1 /* Hard requirement */ emu0 -> vSMMU0 /* Soft requirement; can be vSMMU1 also */ emu1 -> vSMMU1 /* Soft requirement; can be vSMMU0 also */ Thanks Nicolin I agree w/Jason & Nicolin: different vSMMUs for pass-through devices than emulated, & vice-versa. Not mixing... because... of the next agreement: you need to clarify what you mean by different vSMMUs: are you taking about different instances or different qemu device types? Both. a device needed to use hw-accel feature has to use an smmu that has that feature; an emulated device can use such an smmu, but as mentioned in other threads, if you start with all emulated in one smmu, if you hot-plug a (assigned) device, it needs another smmu that has hw-accel features. Keeping them split makes it easier at config time, and it may enable the code to be simpler... but the other half of my brain wants common code paths with accel/emulate branches but a different smmu instance will like simplify the smmu-(accel-)specific lookups. I agree with Eric that 'accel' isn't needed -- this should be ascertained from the pSMMU that a physical device is attached to. we can simply use an AUTO_ON_OFF property and by default choose AUTO value. That would close the debate ;-) Preaching to the choir... yes. Eric Now... how does vfio(?; why not qemu?) layer determine that? -- where are SMMUv3 'accel' features exposed either: a) in the device struct (for the smmuv3) or (b) somewhere under sysfs?
Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus
On 3/19/25 2:21 PM, Eric Auger wrote: Hi Don, On 3/19/25 5:21 PM, Donald Dutile wrote: On 3/19/25 5:26 AM, Shameerali Kolothum Thodi wrote: Hi Don, Hey! -Original Message- From: Donald Dutile Sent: Tuesday, March 18, 2025 10:12 PM To: Shameerali Kolothum Thodi ; qemu-...@nongnu.org; qemu-devel@nongnu.org Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb- pcie bus Shameer, Hi! On 3/11/25 10:10 AM, Shameer Kolothum wrote: User must associate a pxb-pcie root bus to smmuv3-accel and that is set as the primary-bus for the smmu dev. Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index c327661636..1471b65374 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -9,6 +9,21 @@ #include "qemu/osdep.h" #include "hw/arm/smmuv3-accel.h" +#include "hw/pci/pci_bridge.h" + +static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque) +{ + DeviceState *d = opaque; + + if (object_dynamic_cast(obj, "pxb-pcie-bus")) { + PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus; + if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus- name)) { + object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus), + &error_abort); + } + } + return 0; +} static void smmu_accel_realize(DeviceState *d, Error **errp) { @@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error **errp) SysBusDevice *dev = SYS_BUS_DEVICE(d); Error *local_err = NULL; + object_child_foreach_recursive(object_get_root(), + smmuv3_accel_pxb_pcie_bus, d); + object_property_set_bool(OBJECT(dev), "accel", true, &error_abort); c->parent_realize(d, &local_err); if (local_err) { @@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass *klass, void *data) device_class_set_parent_realize(dc, smmu_accel_realize, &c->parent_realize); dc->hotpluggable = false; + dc->bus_type = TYPE_PCIE_BUS; } static const TypeInfo smmuv3_accel_type_info = { I am not seeing the need for a pxb-pcie bus(switch) introduced for each 'accel'. Isn't the IORT able to define different SMMUs for different RIDs? if so, itsn't that sufficient to associate (define) an SMMU<->RID association without introducing a pxb-pcie? and again, I'm not sure how that improves/enables the device<->SMMU associativity? Thanks for taking a look at the series. As discussed elsewhere in this thread(with Eric), normally in physical world (or atleast in the most common cases) SMMUv3 is attached to PCIe Root Complex and if you take a look at the IORT spec, it describes association of ID mappings between a RC node and SMMUV3 node. And if my understanding is correct, in Qemu, only pxb-pcie allows you to add extra root complexes even though it is still plugged to parent(pcie.0). ie, for all devices downstream it acts as a root complex but still plugged into a parent pcie.0. This allows us to add/describe multiple "smmuv3-accel" each associated with a RC. I find the qemu statements a bit unclear here as well. I looked at the hot plug statement(s) in docs/pcie.txt, as I figured that's where dynamic IORT changes would be needed as well. There, it says you can hot-add PCIe devices to RPs, one has to define/add RP's to the machine model for that plug-in. Using libvirt, it could auto-add the needed RPs to do dynmaic smmuv3 additions, I am not sure I understand your statement here. we don't want "dynamic" SMMUv3 instantiation. SMMUv3 is a platform device which is supposed to be coldplugged on a pre-existing PCIe hierarchy. The SMMUv3 device is not something that is meant to be hotplugged or hotunplugged. To me we hijack the bus= property to provide information about the IORT IDMAP Dynamic in the sense that if one adds smmuv3 for multiple devices, libvirt will dynamically figure out how to instantiate one, two, three... smmu's in the machine at cold boot. If you want a machine to be able to hot-plug a device that would require another smmu, than the config, and smmu, would have to be explicilty stated; as is done today for hot-plug PCIe if the simple machine that libvirt would make is not sufficient to hot-add a PCIe device. Thanks Eric if I understand how libvirt does that today for pcie devices now (/me looks at d
Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus
On 3/24/25 10:56 AM, Eric Auger wrote: On 3/21/25 1:59 AM, Donald Dutile wrote: On 3/19/25 2:21 PM, Eric Auger wrote: Hi Don, On 3/19/25 5:21 PM, Donald Dutile wrote: On 3/19/25 5:26 AM, Shameerali Kolothum Thodi wrote: Hi Don, Hey! -Original Message- From: Donald Dutile Sent: Tuesday, March 18, 2025 10:12 PM To: Shameerali Kolothum Thodi ; qemu-...@nongnu.org; qemu-devel@nongnu.org Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb- pcie bus Shameer, Hi! On 3/11/25 10:10 AM, Shameer Kolothum wrote: User must associate a pxb-pcie root bus to smmuv3-accel and that is set as the primary-bus for the smmu dev. Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index c327661636..1471b65374 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -9,6 +9,21 @@ #include "qemu/osdep.h" #include "hw/arm/smmuv3-accel.h" +#include "hw/pci/pci_bridge.h" + +static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque) +{ + DeviceState *d = opaque; + + if (object_dynamic_cast(obj, "pxb-pcie-bus")) { + PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus; + if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus- name)) { + object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus), + &error_abort); + } + } + return 0; +} static void smmu_accel_realize(DeviceState *d, Error **errp) { @@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error **errp) SysBusDevice *dev = SYS_BUS_DEVICE(d); Error *local_err = NULL; + object_child_foreach_recursive(object_get_root(), + smmuv3_accel_pxb_pcie_bus, d); + object_property_set_bool(OBJECT(dev), "accel", true, &error_abort); c->parent_realize(d, &local_err); if (local_err) { @@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass *klass, void *data) device_class_set_parent_realize(dc, smmu_accel_realize, &c->parent_realize); dc->hotpluggable = false; + dc->bus_type = TYPE_PCIE_BUS; } static const TypeInfo smmuv3_accel_type_info = { I am not seeing the need for a pxb-pcie bus(switch) introduced for each 'accel'. Isn't the IORT able to define different SMMUs for different RIDs? if so, itsn't that sufficient to associate (define) an SMMU<->RID association without introducing a pxb-pcie? and again, I'm not sure how that improves/enables the device<->SMMU associativity? Thanks for taking a look at the series. As discussed elsewhere in this thread(with Eric), normally in physical world (or atleast in the most common cases) SMMUv3 is attached to PCIe Root Complex and if you take a look at the IORT spec, it describes association of ID mappings between a RC node and SMMUV3 node. And if my understanding is correct, in Qemu, only pxb-pcie allows you to add extra root complexes even though it is still plugged to parent(pcie.0). ie, for all devices downstream it acts as a root complex but still plugged into a parent pcie.0. This allows us to add/describe multiple "smmuv3-accel" each associated with a RC. I find the qemu statements a bit unclear here as well. I looked at the hot plug statement(s) in docs/pcie.txt, as I figured that's where dynamic IORT changes would be needed as well. There, it says you can hot-add PCIe devices to RPs, one has to define/add RP's to the machine model for that plug-in. Using libvirt, it could auto-add the needed RPs to do dynmaic smmuv3 additions, I am not sure I understand your statement here. we don't want "dynamic" SMMUv3 instantiation. SMMUv3 is a platform device which is supposed to be coldplugged on a pre-existing PCIe hierarchy. The SMMUv3 device is not something that is meant to be hotplugged or hotunplugged. To me we hijack the bus= property to provide information about the IORT IDMAP Dynamic in the sense that if one adds smmuv3 for multiple devices, libvirt will dynamically figure out how to instantiate one, two, three... smmu's in the machine at cold boot. If you want a machine to be able to hot-plug a device that would require another smmu, than the config, and smmu, would have to be explicilty stated; as is done today for hot-plug PCIe if the simple machine that libvirt would make is not sufficient to hot-add a PCIe devic
Re: [RFC PATCH v2 16/20] hw/arm/smmuv3-accel: Read host SMMUv3 device info
Shameer, Hey, On 3/11/25 10:10 AM, Shameer Kolothum wrote: From: Nicolin Chen Read the underlying SMMUv3 device info and set corresponding IDR bits. We need at least one cold-plugged vfio-pci dev associated with the smmuv3-accel instance to do this now. Hence fail if it is not available. ToDo: The above requirement will be relaxed in future when we add support in the kernel. Signed-off-by: Nicolin Chen Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c | 104 ++ hw/arm/trace-events | 1 + include/hw/arm/smmuv3-accel.h | 2 + 3 files changed, 107 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index 09be838d22..fb08e1d66b 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -15,6 +15,96 @@ #include "smmuv3-internal.h" +static int +smmuv3_accel_dev_get_info(SMMUv3AccelDevice *accel_dev, uint32_t *data_type, + uint32_t data_len, void *data) +{ +uint64_t caps; + +if (!accel_dev || !accel_dev->idev) { +return -ENOENT; +} + +return !iommufd_backend_get_device_info(accel_dev->idev->iommufd, +accel_dev->idev->devid, +data_type, data, +data_len, &caps, NULL); +} + +static void smmuv3_accel_init_regs(SMMUv3AccelState *s_accel) +{ +SMMUv3State *s = ARM_SMMUV3(s_accel); +SMMUv3AccelDevice *accel_dev; +uint32_t data_type; +uint32_t val; +int ret; + +if (!s_accel->viommu || QLIST_EMPTY(&s_accel->viommu->device_list)) { +error_report("At least one cold-plugged vfio-pci is required for smmuv3-accel!"); +exit(1); +} + +accel_dev = QLIST_FIRST(&s_accel->viommu->device_list); +if (accel_dev->info.idr[0]) { +info_report("reusing the previous hw_info"); +goto out; +} + +ret = smmuv3_accel_dev_get_info(accel_dev, &data_type, +sizeof(accel_dev->info), &accel_dev->info); +if (ret) { +error_report("failed to get SMMU device info"); +return; +} + +if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) { +error_report("Wrong data type (%d)!", data_type); +return; +} + +out: +trace_smmuv3_accel_get_device_info(accel_dev->info.idr[0], + accel_dev->info.idr[1], + accel_dev->info.idr[3], + accel_dev->info.idr[5]); + +val = FIELD_EX32(accel_dev->info.idr[0], IDR0, BTM); +s->idr[0] = FIELD_DP32(s->idr[0], IDR0, BTM, val); +val = FIELD_EX32(accel_dev->info.idr[0], IDR0, ATS); +s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, val); +val = FIELD_EX32(accel_dev->info.idr[0], IDR0, ASID16); +s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ASID16, val); +val = FIELD_EX32(accel_dev->info.idr[0], IDR0, TERM_MODEL); +s->idr[0] = FIELD_DP32(s->idr[0], IDR0, TERM_MODEL, val); +val = FIELD_EX32(accel_dev->info.idr[0], IDR0, STALL_MODEL); +s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STALL_MODEL, val); +val = FIELD_EX32(accel_dev->info.idr[0], IDR0, STLEVEL); +s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STLEVEL, val); + +val = FIELD_EX32(accel_dev->info.idr[1], IDR1, SIDSIZE); +s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SIDSIZE, val); +val = FIELD_EX32(accel_dev->info.idr[1], IDR1, SSIDSIZE); +s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SSIDSIZE, val); + +val = FIELD_EX32(accel_dev->info.idr[3], IDR3, HAD); +s->idr[3] = FIELD_DP32(s->idr[3], IDR3, HAD, val); +val = FIELD_EX32(accel_dev->info.idr[3], IDR3, RIL); +s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, val); +val = FIELD_EX32(accel_dev->info.idr[3], IDR3, BBML); +s->idr[3] = FIELD_DP32(s->idr[3], IDR3, BBML, val); + +val = FIELD_EX32(accel_dev->info.idr[5], IDR5, GRAN4K); +s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, val); +val = FIELD_EX32(accel_dev->info.idr[5], IDR5, GRAN16K); +s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, val); +val = FIELD_EX32(accel_dev->info.idr[5], IDR5, GRAN64K); +s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, val); +val = FIELD_EX32(accel_dev->info.idr[5], IDR5, OAS); +s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, val); + +/* FIXME check iidr and aidr registrs too */ +} + static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *s, SMMUPciBus *sbus, PCIBus *bus, int devfn) { @@ -484,11 +574,25 @@ static void smmu_accel_realize(DeviceState *d, Error **errp) bs->unset_iommu_device = smmuv3_accel_unset_iommu_device; } +static void smmuv3_accel_reset_hold(Object *obj, ResetType type) +{ +SMMUv3AccelState *s = ARM_SMMUV3_ACCEL(obj); +SMMUv3AccelClass *c = ARM_SMMUV3_ACCEL_GET_CLASS(s); + +if (c->parent_phases.hold) { +c->
Re: [RFC PATCH v2 18/20] hw/arm/smmu-common: Bypass emulated IOTLB for a accel SMMUv3
On 3/11/25 10:10 AM, Shameer Kolothum wrote: From: Nicolin Chen If a vSMMU is configured as a accelerated one, HW IOTLB will be used and all cache invalidation should be done to the HW IOTLB too, v.s. the emulated iotlb. In this case, an iommu notifier isn't registered, as the devices behind a SMMUv3-accel would stay in the system address space for stage-2 mappings. However, the KVM code still requests an iommu address space to translate an MSI doorbell gIOVA via get_msi_address_space() and translate(). Since a SMMUv3-accel doesn't register an iommu notifier to flush emulated iotlb, bypass the emulated IOTLB and always walk through the guest-level IO page table. Signed-off-by: Nicolin Chen Signed-off-by: Shameer Kolothum --- hw/arm/smmu-common.c | 21 + 1 file changed, 21 insertions(+) diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c index 9fd455baa0..fd10df8866 100644 --- a/hw/arm/smmu-common.c +++ b/hw/arm/smmu-common.c @@ -77,6 +77,17 @@ static SMMUTLBEntry *smmu_iotlb_lookup_all_levels(SMMUState *bs, uint8_t level = 4 - (inputsize - 4) / stride; SMMUTLBEntry *entry = NULL; +/* + * Stage-1 translation with a accel SMMU in general uses HW IOTLB. However, + * KVM still requests for an iommu address space for an MSI fixup by looking + * up stage-1 page table. Make sure we don't go through the emulated pathway + * so that the emulated iotlb will not need any invalidation. + */ + +if (bs->accel) { +return NULL; +} + while (level <= 3) { uint64_t subpage_size = 1ULL << level_shift(level, tt->granule_sz); uint64_t mask = subpage_size - 1; @@ -142,6 +153,16 @@ void smmu_iotlb_insert(SMMUState *bs, SMMUTransCfg *cfg, SMMUTLBEntry *new) SMMUIOTLBKey *key = g_new0(SMMUIOTLBKey, 1); uint8_t tg = (new->granule - 10) / 2; +/* + * Stage-1 translation with a accel SMMU in general uses HW IOTLB. However, + * KVM still requests for an iommu address space for an MSI fixup by looking + * up stage-1 page table. Make sure we don't go through the emulated pathway + * so that the emulated iotlb will not need any invalidation. + */ +if (bs->accel) { +return; +} + if (g_hash_table_size(bs->iotlb) >= SMMU_IOTLB_MAX_SIZE) { smmu_iotlb_inv_all(bs); } Ah! ... if 'accel', skip emulated code since hw handling it... in common smmu code... I like it! :) - Don
Re: [RFC PATCH v2 13/20] hw/arm/smmuv3-accel: Introduce helpers to batch and issue cache invalidations
Shameer, Hi, On 3/11/25 10:10 AM, Shameer Kolothum wrote: From: Nicolin Chen Inroduce an SMMUCommandBatch and some helpers to batch and issue the Introduce commands. Currently separate out TLBI commands and device cache commands to avoid some errata on certain versions of SMMUs. Later it should check IIDR register to detect if underlying SMMU hw has such an erratum. Where is all this info about 'certain versions of SMMUs' and 'check IIDR register' has something to do with 'underlying SMMU hw such an erratum', -- which IIDR (& bits)? or are we talking about rsvd SMMU_IDR<> registers? And can't these helpers be used for emulated smmuv3 as well as accelerated? Thanks, - Don Signed-off-by: Nicolin Chen Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c| 69 hw/arm/smmuv3-internal.h | 29 + 2 files changed, 98 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index 76134d106a..09be838d22 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -160,6 +160,75 @@ void smmuv3_accel_install_nested_ste(SMMUDevice *sdev, int sid) nested_data.ste[0]); } +/* Update batch->ncmds to the number of execute cmds */ +int smmuv3_accel_issue_cmd_batch(SMMUState *bs, SMMUCommandBatch *batch) +{ +SMMUv3AccelState *s_accel = ARM_SMMUV3_ACCEL(bs); +uint32_t total = batch->ncmds; +IOMMUFDViommu *viommu_core; +int ret; + +if (!bs->accel) { +return 0; +} + +if (!s_accel->viommu) { +return 0; +} +viommu_core = &s_accel->viommu->core; +ret = iommufd_backend_invalidate_cache(viommu_core->iommufd, + viommu_core->viommu_id, + IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3, + sizeof(Cmd), &batch->ncmds, + batch->cmds); +if (total != batch->ncmds) { +error_report("%s failed: ret=%d, total=%d, done=%d", + __func__, ret, total, batch->ncmds); +return ret; +} + +batch->ncmds = 0; +batch->dev_cache = false; +return ret; +} + +int smmuv3_accel_batch_cmds(SMMUState *bs, SMMUDevice *sdev, +SMMUCommandBatch *batch, Cmd *cmd, +uint32_t *cons, bool dev_cache) +{ +int ret; + +if (!bs->accel) { +return 0; +} + +if (sdev) { +SMMUv3AccelDevice *accel_dev; +accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev); +if (!accel_dev->s1_hwpt) { +return 0; +} +} + +/* + * Currently separate out dev_cache and hwpt for safety, which might + * not be necessary if underlying HW SMMU does not have the errata. + * + * TODO check IIDR register values read from hw_info. + */ +if (batch->ncmds && (dev_cache != batch->dev_cache)) { +ret = smmuv3_accel_issue_cmd_batch(bs, batch); +if (ret) { +*cons = batch->cons[batch->ncmds]; +return ret; +} +} +batch->dev_cache = dev_cache; +batch->cmds[batch->ncmds] = *cmd; +batch->cons[batch->ncmds++] = *cons; +return 0; +} + static bool smmuv3_accel_dev_attach_viommu(SMMUv3AccelDevice *accel_dev, HostIOMMUDeviceIOMMUFD *idev, Error **errp) diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h index 46c8bcae14..4602ae6728 100644 --- a/hw/arm/smmuv3-internal.h +++ b/hw/arm/smmuv3-internal.h @@ -549,13 +549,42 @@ typedef struct CD { uint32_t word[16]; } CD; +/** + * SMMUCommandBatch - batch of invalidation commands for smmuv3-accel + * @cmds: Pointer to list of commands + * @cons: Pointer to list of CONS corresponding to the commands + * @ncmds: Total ncmds in the batch + * @dev_cache: Issue to a device cache + */ +typedef struct SMMUCommandBatch { +Cmd *cmds; +uint32_t *cons; +uint32_t ncmds; +bool dev_cache; +} SMMUCommandBatch; + int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event); void smmuv3_flush_config(SMMUDevice *sdev); #if defined(CONFIG_ARM_SMMUV3_ACCEL) && defined(CONFIG_IOMMUFD) +int smmuv3_accel_issue_cmd_batch(SMMUState *bs, SMMUCommandBatch *batch); +int smmuv3_accel_batch_cmds(SMMUState *bs, SMMUDevice *sdev, +SMMUCommandBatch *batch, Cmd *cmd, +uint32_t *cons, bool dev_cache); void smmuv3_accel_install_nested_ste(SMMUDevice *sdev, int sid); #else +static inline int smmuv3_accel_issue_cmd_batch(SMMUState *bs, + SMMUCommandBatch *batch) +{ +return 0; +} +static inline int smmuv3_accel_batch_cmds(SMMUState *bs, SMMUDevice *sdev, + SMMUCommandBatch *batch, Cmd *cmd, +
Re: [RFC PATCH v2 11/20] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device
Shameer, Hi, On 3/11/25 10:10 AM, Shameer Kolothum wrote: From: Nicolin Chen Allocate and associate a vDEVICE object for the Guest device with the vIOMMU. This will help the kernel to do the vSID --> sid translation whenever required (eg: device specific invalidations). Signed-off-by: Nicolin Chen Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c | 22 ++ include/hw/arm/smmuv3-accel.h | 6 ++ 2 files changed, 28 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index d3a5cf9551..056bd23b2e 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -109,6 +109,20 @@ void smmuv3_accel_install_nested_ste(SMMUDevice *sdev, int sid) return; } +if (!accel_dev->vdev && accel_dev->idev) { +SMMUVdev *vdev; +uint32_t vdev_id; +SMMUViommu *viommu = accel_dev->viommu; + +iommufd_backend_alloc_vdev(viommu->core.iommufd, accel_dev->idev->devid, + viommu->core.viommu_id, sid, &vdev_id, + &error_abort); +vdev = g_new0(SMMUVdev, 1); +vdev->vdev_id = vdev_id; +vdev->sid = sid; +accel_dev->vdev = vdev; +} + ret = smmu_find_ste(sdev->smmu, sid, &ste, &event); if (ret) { /* @@ -283,6 +297,7 @@ static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn, static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque, int devfn) { +SMMUVdev *vdev; SMMUDevice *sdev; SMMUv3AccelDevice *accel_dev; SMMUViommu *viommu; @@ -312,6 +327,13 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque, trace_smmuv3_accel_unset_iommu_device(devfn, smmu_get_sid(sdev)); viommu = s_accel->viommu; +vdev = accel_dev->vdev; +if (vdev) { +iommufd_backend_free_id(viommu->iommufd, vdev->vdev_id); +g_free(vdev); +accel_dev->vdev = NULL; +} + if (QLIST_EMPTY(&viommu->device_list)) { iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id); iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id); diff --git a/include/hw/arm/smmuv3-accel.h b/include/hw/arm/smmuv3-accel.h index d6b0b1ca30..54b217ab4f 100644 --- a/include/hw/arm/smmuv3-accel.h +++ b/include/hw/arm/smmuv3-accel.h @@ -35,6 +35,11 @@ typedef struct SMMUViommu { QLIST_ENTRY(SMMUViommu) next; } SMMUViommu; +typedef struct SMMUVdev { +uint32_t vdev_id; +uint32_t sid; +} SMMUVdev; + Shouldn't this be 'IOMMUFDVdev' ... it's not an SMMU (v)dev , it's an IOMMUFD/vIOMMU vDEVICE for this SMMU typedef struct SMMUS1Hwpt { IOMMUFDBackend *iommufd; uint32_t hwpt_id; @@ -45,6 +50,7 @@ typedef struct SMMUv3AccelDevice { HostIOMMUDeviceIOMMUFD *idev; SMMUS1Hwpt *s1_hwpt; SMMUViommu *viommu; +SMMUVdev *vdev; QLIST_ENTRY(SMMUv3AccelDevice) next; } SMMUv3AccelDevice;
Re: [RFC PATCH v2 04/20] hw/arm/virt: Add support for smmuv3-accel
Doesn't this commit become moot, if accel becomes an smmuv3 option vs separate device object altogether, dynamically added if a pdev is attached to a host SMMUv3 that has accel feature(s)? Blocking w/virtio-iommu falls under the same situation mentioned in 03/20 wrt mixing emulated & physical devices on the same smmuv3. - Don On 3/11/25 10:10 AM, Shameer Kolothum wrote: Allow cold-plug smmuv3-accel to virt If the machine wide smmuv3 is not specified. No FDT support is added for now. Signed-off-by: Shameer Kolothum --- hw/arm/virt.c | 12 hw/core/sysbus-fdt.c | 1 + include/hw/arm/virt.h | 1 + 3 files changed, 14 insertions(+) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 4a5a9666e9..84a323da55 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -73,6 +73,7 @@ #include "qobject/qlist.h" #include "standard-headers/linux/input.h" #include "hw/arm/smmuv3.h" +#include "hw/arm/smmuv3-accel.h" #include "hw/acpi/acpi.h" #include "target/arm/cpu-qom.h" #include "target/arm/internals.h" @@ -2911,6 +2912,16 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev, platform_bus_link_device(PLATFORM_BUS_DEVICE(vms->platform_bus_dev), SYS_BUS_DEVICE(dev)); } +if (object_dynamic_cast(OBJECT(dev), TYPE_ARM_SMMUV3_ACCEL)) { +if (vms->iommu == VIRT_IOMMU_SMMUV3) { +error_setg(errp, + "iommu=smmuv3 is already specified. can't create smmuv3-accel dev"); +return; +} +if (vms->iommu != VIRT_IOMMU_SMMUV3_ACCEL) { +vms->iommu = VIRT_IOMMU_SMMUV3_ACCEL; +} +} } if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) { @@ -3120,6 +3131,7 @@ static void virt_machine_class_init(ObjectClass *oc, void *data) machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_AMD_XGBE); machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE); machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_PLATFORM); +machine_class_allow_dynamic_sysbus_dev(mc, TYPE_ARM_SMMUV3_ACCEL); #ifdef CONFIG_TPM machine_class_allow_dynamic_sysbus_dev(mc, TYPE_TPM_TIS_SYSBUS); #endif diff --git a/hw/core/sysbus-fdt.c b/hw/core/sysbus-fdt.c index 774c0aed41..c8502ad830 100644 --- a/hw/core/sysbus-fdt.c +++ b/hw/core/sysbus-fdt.c @@ -489,6 +489,7 @@ static const BindingEntry bindings[] = { #ifdef CONFIG_LINUX TYPE_BINDING(TYPE_VFIO_CALXEDA_XGMAC, add_calxeda_midway_xgmac_fdt_node), TYPE_BINDING(TYPE_VFIO_AMD_XGBE, add_amd_xgbe_fdt_node), +TYPE_BINDING("arm-smmuv3-accel", no_fdt_node), VFIO_PLATFORM_BINDING("amd,xgbe-seattle-v1a", add_amd_xgbe_fdt_node), #endif #ifdef CONFIG_TPM diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h index c8e94e6aed..849d1cd5b5 100644 --- a/include/hw/arm/virt.h +++ b/include/hw/arm/virt.h @@ -92,6 +92,7 @@ enum { typedef enum VirtIOMMUType { VIRT_IOMMU_NONE, VIRT_IOMMU_SMMUV3, +VIRT_IOMMU_SMMUV3_ACCEL, VIRT_IOMMU_VIRTIO, } VirtIOMMUType;
Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus
On 3/19/25 5:26 AM, Shameerali Kolothum Thodi wrote: Hi Don, Hey! -Original Message- From: Donald Dutile Sent: Tuesday, March 18, 2025 10:12 PM To: Shameerali Kolothum Thodi ; qemu-...@nongnu.org; qemu-devel@nongnu.org Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb- pcie bus Shameer, Hi! On 3/11/25 10:10 AM, Shameer Kolothum wrote: User must associate a pxb-pcie root bus to smmuv3-accel and that is set as the primary-bus for the smmu dev. Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3-accel.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index c327661636..1471b65374 100644 --- a/hw/arm/smmuv3-accel.c +++ b/hw/arm/smmuv3-accel.c @@ -9,6 +9,21 @@ #include "qemu/osdep.h" #include "hw/arm/smmuv3-accel.h" +#include "hw/pci/pci_bridge.h" + +static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque) +{ +DeviceState *d = opaque; + +if (object_dynamic_cast(obj, "pxb-pcie-bus")) { +PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus; +if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus- name)) { +object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus), + &error_abort); +} +} +return 0; +} static void smmu_accel_realize(DeviceState *d, Error **errp) { @@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error **errp) SysBusDevice *dev = SYS_BUS_DEVICE(d); Error *local_err = NULL; +object_child_foreach_recursive(object_get_root(), + smmuv3_accel_pxb_pcie_bus, d); + object_property_set_bool(OBJECT(dev), "accel", true, &error_abort); c->parent_realize(d, &local_err); if (local_err) { @@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass *klass, void *data) device_class_set_parent_realize(dc, smmu_accel_realize, &c->parent_realize); dc->hotpluggable = false; +dc->bus_type = TYPE_PCIE_BUS; } static const TypeInfo smmuv3_accel_type_info = { I am not seeing the need for a pxb-pcie bus(switch) introduced for each 'accel'. Isn't the IORT able to define different SMMUs for different RIDs? if so, itsn't that sufficient to associate (define) an SMMU<->RID association without introducing a pxb-pcie? and again, I'm not sure how that improves/enables the device<->SMMU associativity? Thanks for taking a look at the series. As discussed elsewhere in this thread(with Eric), normally in physical world (or atleast in the most common cases) SMMUv3 is attached to PCIe Root Complex and if you take a look at the IORT spec, it describes association of ID mappings between a RC node and SMMUV3 node. And if my understanding is correct, in Qemu, only pxb-pcie allows you to add extra root complexes even though it is still plugged to parent(pcie.0). ie, for all devices downstream it acts as a root complex but still plugged into a parent pcie.0. This allows us to add/describe multiple "smmuv3-accel" each associated with a RC. I find the qemu statements a bit unclear here as well. I looked at the hot plug statement(s) in docs/pcie.txt, as I figured that's where dynamic IORT changes would be needed as well. There, it says you can hot-add PCIe devices to RPs, one has to define/add RP's to the machine model for that plug-in. Using libvirt, it could auto-add the needed RPs to do dynmaic smmuv3 additions, if I understand how libvirt does that today for pcie devices now (/me looks at danpb for feedback). Having said that, current code only allows pxb-pcie root complexes avoiding the pcie.0. The idea behind this was, user can use pcie.0 with a non accel SMMUv3 for any emulated devices avoiding the performance bottlenecks we are discussing for emulated dev+smmuv3-accel cases. But based on the feedback from Eric and Daniel I will relax that restriction and will allow association with pcie.0. So, I think this isn't a restriction that this smmuv3 feature should enforce; lack of a proper RP or pxb-pcie will yield an invalid config issue/error, and the machine definition will be modified to meet the needs for IORT. Thanks, Shameer to root complexes. Feel free to enlighten me where I may have mis-read/interpreted the IORT & SMMUv3 specs. Thanks, - Don
Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
Zhenzhong, Hi! Eric asked me to review this series. Since it's rather late since you posted will summarize review feedback below/bottom. - Don On 2/19/25 3:22 AM, Zhenzhong Duan wrote: Hi, Per Jason Wang's suggestion, iommufd nesting series[1] is split into "Enable stage-1 translation for emulated device" series and "Enable stage-1 translation for passthrough device" series. This series is 2nd part focusing on passthrough device. We don't do shadowing of guest page table for passthrough device but pass stage-1 page table to host side to construct a nested domain. There was some effort to enable this feature in old days, see [2] for details. The key design is to utilize the dual-stage IOMMU translation (also known as IOMMU nested translation) capability in host IOMMU. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-. .---. | vIOMMU| | Guest I/O page table | | | '---' ./ | PASID Entry |--- PASID cache flush --+ '-'| | |V | | I/O page table pointer in GPA '-' Guest --| Shadow |---| vv v Host .-. .. | pIOMMU| | FS for GIOVA->GPA | | | '' ./ | | PASID Entry | V (Nested xlate) '\.--. | | | SS for GPA->HPA, unmanaged domain| | | '--' '-' Where: - FS = First stage page tables - SS = Second stage page tables I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second stage. We don't need different terms for the same technology in the iommu/iommufd space(s). There are some interactions between VFIO and vIOMMU * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI subsystem. VFIO calls them to register/unregister HostIOMMUDevice instance to vIOMMU at vfio device realize stage. * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt to bind/unbind device to IOMMUFD backed domains, either nested domain or not. See below diagram: VFIO Device Intel IOMMU .-. .---. | | | | | .-|PCIIOMMUOps |.-.| | | IOMMUFD |(set_iommu_device) || Host IOMMU || | | Device |>|| Device list || | .-|(unset_iommu_device) |.-.| | | | | | | | | V | | .-| HostIOMMUDeviceIOMMUFD | .-. | | | IOMMUFD |(attach_hwpt)| | Host IOMMU | | | | link|<| | Device| | | .-|(detach_hwpt)| .-. | | | | | | | | | ... | .-. .---. Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt whenever possible and create new one on demand, also supports multiple iommufd objects and ERRATA_772415. E.g., Stage-2 page table could be shared by different devices if there is no conflict and devices link to same iommufd object, i.e. devices under same host IOMMU can share same stage-2 page table. If there is and 'devices under the same guest'. Different guests cant be sharing the same stage-2 page tables. conflict, i.e. there is one device under non cache coherency mode which is different from others, it requires a separate stage-2 page table in non-CC mode. SPR platform has ERRATA_772415 which requires no readonly mappings in stage-2 page table. This series supports creating VTDIOASContainer with no readonly mappings. If there is a rare case that some IOMMUs on a multiple IOMMU host have ERRATA_772415 and others not, this design can still survive. See below example diagram for a full view: IntelIOMMUState | V .--..--..---.
Re: [PATCH 2/5] vfio: Move realize() after attach_device()
On 4/11/25 6:17 AM, Zhenzhong Duan wrote: Previously device attaching depends on realize() getting host iommu capabilities to check dirty tracking support. Now we save a caps copy in VFIODevice and check that copy for dirty tracking support, there is no dependency any more, move realize() call after attach_device() call in vfio_device_attach(). Drop vfio_device_hiod_realize() which looks redundant now. Suggested-by: Cédric Le Goater Suggested-by: Donald Dutile Signed-off-by: Zhenzhong Duan Reviewed-by: Donald Dutile
Re: [PATCH 0/5] Add support for user creatable SMMUv3 device
Shameer, Hi! First off, like the partitioning of these pieces. On 4/15/25 4:10 AM, Shameer Kolothum wrote: Hi All, This patch series introduces support for a user-creatable SMMUv3 device (-device arm-smmuv3-dev) in QEMU. Can we drop the '-dev', as 'arm-smmu' is sufficient, as is '-device'? I know, internal to QEMU, there's already an ARM_SMMU, but as suggested later, if it uses the same struct, the qemu cmdline syntax is a bit less redundant. The implementation is based on feedback received from the RFCv2[0]: "hw/arm/virt: Add support for user-creatable accelerated SMMUv3" Currently, QEMU's SMMUv3 emulation (iommu=smmuv3) is tied to the machine and does not support instantiating multiple SMMUv3 devices—each associated with a separate PCIe root complex. In contrast, real-world ARM systems often include multiple SMMUv3 instances, each bound to a different PCIe root complex. This also lays the groundwork for supporting accelerated SMMUv3, as proposed in the aforementioned RFC. Please note, the accelerated SMMUv3 support is not part of this series and will be sent out as a separate series later on top of this one. Summary of changes: -Introduces support for multiple -device arm-smmuv3-dev,bus=pcie.x instances. Example usage: -device arm-smmuv3-dev,bus=pcie.0 -device virtio-net-pci,bus=pcie.0 -device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3-dev,bus=pcie.1 -device pcie-root-port,id=pcie.port1,bus=pcie.1 -device virtio-net-pci,bus=pcie.port1 -Supports either the legacy iommu=smmuv3 option or the new "-device arm-smmuv3-dev" model. nice! :) Can it support both? i.e., some devices using a system-wide, legacy smmuv3, and some pcie devices using the -device arm-smmuv3 ? If not, then add a check to min-warn or max-fail, that config. I can see old machines being converted/upgraded to new machines, and they will want to switch from legacy->device smmuv3, and catching an edit/update to a machine change (or enabling automated conversion) would be helpful. -Adds device tree bindings for the new SMMUv3 device on the arm/virt machine only, and only for the default pcie.0 root complex. (Note: pxb-pcie root complexes are currently not supported with the device tree due to known limitations[1].) -Restricts usage of the new SMMUv3 device to virt machine versions > 9.2. This is to avoid creating new smmuv3 device for old virt machine versions and accidently breaking the migration support. Please take a look and let me know your feedback. Thanks, Shameer [0]:https://lore.kernel.org/qemu-devel/20250311141045.66620-1-shameerali.kolothum.th...@huawei.com/ [1]:https://lore.kernel.org/qemu-devel/20230421165037.2506-1-jonathan.came...@huawei.com/ Shameer Kolothum (5): hw/arm/smmuv3: Introduce SMMUv3 device hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices hw/arm/virt: Factor out common SMMUV3 dt bindings code hw/arm/virt: Add support for smmuv3 device hw/arm/smmuv3: Enable smmuv3 device creation hw/arm/smmuv3.c | 55 ++ hw/arm/virt-acpi-build.c | 119 +++ hw/arm/virt.c| 109 ++- hw/core/sysbus-fdt.c | 3 + include/hw/arm/smmuv3.h | 16 ++ include/hw/arm/virt.h| 2 + 6 files changed, 266 insertions(+), 38 deletions(-)
Re: [PATCH 2/5] hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices
On 4/16/25 1:38 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Nicolin Chen Sent: Wednesday, April 16, 2025 5:19 AM To: Shameerali Kolothum Thodi Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; ddut...@redhat.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [PATCH 2/5] hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices On Tue, Apr 15, 2025 at 09:11:01AM +0100, Shameer Kolothum wrote: +static int get_smmuv3_device(Object *obj, void *opaque) +{ +GArray *sdev_blob = opaque; +int min_bus, max_bus; +VirtMachineState *vms; +PlatformBusDevice *pbus; +SysBusDevice *sbdev; +SMMUv3Device sdev; +hwaddr base; +int irq; +PCIBus *bus; In a reverse christmas tree order? Or we could at least group those similar types together. Yeah. Reverse will look better I guess. +vms = VIRT_MACHINE(qdev_get_machine()); +pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev); +sbdev = SYS_BUS_DEVICE(obj); +base = platform_bus_get_mmio_addr(pbus, sbdev, 0); +irq = platform_bus_get_irqn(pbus, sbdev, 0); + +base += vms->memmap[VIRT_PLATFORM_BUS].base; +irq += vms->irqmap[VIRT_PLATFORM_BUS]; + +pci_bus_range(bus, &min_bus, &max_bus); +sdev.smmu_idmap.input_base = min_bus << 8; +sdev.smmu_idmap.id_count = (max_bus - min_bus + 1) << 8; +sdev.base = base; +sdev.irq = irq + ARM_SPI_BASE; Hmm, these base/irq local variables don't look necessary. +g_array_append_val(sdev_blob, sdev); +return 0; +} + /* * Input Output Remapping Table (IORT) * Conforms to "IO Remapping Table System Software on ARM Platforms", @@ -275,25 +330,42 @@ static void build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) { int i, nb_nodes, rc_mapping_count; -size_t node_size, smmu_offset = 0; +size_t node_size, *smmu_offset = NULL; AcpiIortIdMapping *idmap; +int num_smmus = 0; uint32_t id = 0; GArray *smmu_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping)); GArray *its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping)); +GArray *smmuv3_devices = g_array_new(false, true, sizeof(SMMUv3Device)); AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id, .oem_table_id = vms->oem_table_id }; /* Table 2 The IORT */ acpi_table_begin(&table, table_data); -if (vms->iommu == VIRT_IOMMU_SMMUV3) { -AcpiIortIdMapping next_range = {0}; - +nb_nodes = 2; /* RC, ITS */ +if (vms->iommu == VIRT_IOMMU_SMMUV3_DEV) { +object_child_foreach_recursive(object_get_root(), + get_smmuv3_device, smmuv3_devices); + /* Sort the smmuv3-devices by smmu idmap input_base */ +g_array_sort(smmuv3_devices, smmuv3_dev_idmap_compare); +/* Fill smmu idmap from sorted smmuv3 devices array */ +for (i = 0; i < smmuv3_devices->len; i++) { +SMMUv3Device *s = &g_array_index(smmuv3_devices, SMMUv3Device, i); +g_array_append_val(smmu_idmaps, s->smmu_idmap); +} +num_smmus = smmuv3_devices->len; +} else if (vms->iommu == VIRT_IOMMU_SMMUV3) { object_child_foreach_recursive(object_get_root(), iort_host_bridges, smmu_idmaps); /* Sort the smmu idmap by input_base */ g_array_sort(smmu_idmaps, iort_idmap_compare); VIRT_IOMMU_SMMUV3 seems to fit the struct SMMUv3Device very well, given that it has base, irq, and smmu_idmaps too? One difference though is the legacy VIRT_IOMMU_SMMUV3 one is a global Machine wide one nad can be associated with multiple PCIe root complexes which will result in smmu_idmaps array. See the iort_host_bridges() fn. Didn't quite follow this logic; shouldn't the iort be built for devices tagged for a virt-iommu or dependent on a pcie rp (bus num/bus-num range of an RP)? Thus, it's just an IORT with one or more IORT nodes? I could see how the current iort_host_bridge() may need a revision to work with -device arm-smmu configs. Which would bring me back to my earlier question, and it's likely tied to this IORT construction: -- can you have both types defined in a machine model? Maybe we could generalize the struct SMMUv3Device to store both cases. Perhaps just SMMUv3AcpiInfo? And then .. I didn't follow 'SMMUv3AcpiInfo' since I don't see that in current qemu tree; or is the statement trying to say its a 'mere matter of the proper acpi info generation'? Could do. But then you have to have a smmu_idmaps array and then check the length of it to cover legacy SMMUv3 case I guess. Aren't they going to be different idmaps for different IORT nodes? > > @@ -341,10 +413,20 @@ build_
Re: [PATCH v2 0/6] Add support for user creatable SMMUv3 device
On 5/2/25 6:27 AM, Shameer Kolothum wrote: Hi All, Changes from v1: https://lore.kernel.org/qemu-devel/20250415081104.71708-1-shameerali.kolothum.th...@huawei.com/ Addressed feedback on v1. Thanks to all. 1. Retained the same name as the legacy SMMUv3(arm-smmuv3) for new device type as well (instead of arm-smmuv3-dev type usage in v1). 2. Changes to ACPI IORT to use the same struct SMMUv3Device for both legacy SMMUv3 and the new SMMUV3 device 3. Removed the restriction of creating SMMUv3 dev if virt ver > 9.2. Cover letter from v1: This patch series introduces support for a user-creatable SMMUv3 device (-device arm-smmuv3) in QEMU. The implementation is based on feedback received from the RFCv2[0]: "hw/arm/virt: Add support for user-creatable accelerated SMMUv3" Currently, QEMU's SMMUv3 emulation (iommu=smmuv3) is tied to the machine should it be clarified as 'to the machine's sysbus' ? and does not support instantiating multiple SMMUv3 devices—each associated with a separate PCIe root complex. In contrast, real-world ARM systems often include multiple SMMUv3 instances, each bound to a different PCIe root complex. This also lays the groundwork for supporting accelerated SMMUv3, as proposed in the aforementioned RFC. Please note, the accelerated SMMUv3 support is not part of this series and will be sent out as a separate series later on top of this one. Summary of changes: -Introduces support for multiple -device arm-smmuv3-dev,bus=pcie.x Drop, as you've done elsewhere. instances. Example usage: -device arm-smmuv3,bus=pcie.0 -device virtio-net-pci,bus=pcie.0 -device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3,bus=pcie.1 -device pcie-root-port,id=pcie.port1,bus=pcie.1 -device virtio-net-pci,bus=pcie.port1 -Supports either the legacy iommu=smmuv3 option or the new "-device arm-smmuv3" model.> -Adds device tree bindings for the new SMMUv3 device on the arm/virt machine only, and only for the default pcie.0 root complex. (Note: pxb-pcie root complexes are currently not supported with the device tree due to known limitations[1].) Please take a look and let me know your feedback. Thanks, Shameer [0]:https://lore.kernel.org/qemu-devel/20250311141045.66620-1-shameerali.kolothum.th...@huawei.com/ [1]:https://lore.kernel.org/qemu-devel/20230421165037.2506-1-jonathan.came...@huawei.com/ Nicolin Chen (1): hw/arm/virt: Add an SMMU_IO_LEN macro Shameer Kolothum (5): hw/arm/smmuv3: Add support to associate a PCIe RC hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices hw/arm/virt: Factor out common SMMUV3 dt bindings code hw/arm/virt: Add support for smmuv3 device hw/arm/smmuv3: Enable smmuv3 device creation hw/arm/smmuv3.c | 27 +++ hw/arm/virt-acpi-build.c | 162 +++ hw/arm/virt.c| 112 --- hw/core/sysbus-fdt.c | 3 + include/hw/arm/virt.h| 1 + 5 files changed, 244 insertions(+), 61 deletions(-)
Re: [PATCH v2 4/6] hw/arm/virt: Add an SMMU_IO_LEN macro
On 5/2/25 6:27 AM, Shameer Kolothum wrote: From: Nicolin Chen This is useful as the subsequent support for new SMMUv3 dev will also use the same. Signed-off-by: Nicolin Chen Signed-off-by: Shameer Kolothum --- hw/arm/virt.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 464e84ae67..e178282d71 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -147,6 +147,9 @@ static void arm_virt_compat_set(MachineClass *mc) #define LEGACY_RAMLIMIT_GB 255 #define LEGACY_RAMLIMIT_BYTES (LEGACY_RAMLIMIT_GB * GiB) +/* MMIO region size for SMMUv3 */ +#define SMMU_IO_LEN 0x2 + /* Addresses and sizes of our components. * 0..128MB is space for a flash device so we can run bootrom code such as UEFI. * 128MB..256MB is used for miscellaneous device I/O. @@ -178,7 +181,7 @@ static const MemMapEntry base_memmap[] = { [VIRT_FW_CFG] = { 0x0902, 0x0018 }, [VIRT_GPIO] = { 0x0903, 0x1000 }, [VIRT_UART1] = { 0x0904, 0x1000 }, -[VIRT_SMMU] = { 0x0905, 0x0002 }, +[VIRT_SMMU] = { 0x0905, SMMU_IO_LEN }, [VIRT_PCDIMM_ACPI] ={ 0x0907, MEMORY_HOTPLUG_IO_LEN }, [VIRT_ACPI_GED] = { 0x0908, ACPI_GED_EVT_SEL_LEN }, [VIRT_NVDIMM_ACPI] ={ 0x0909, NVDIMM_ACPI_IO_LEN}, @@ -1453,7 +1456,6 @@ static void create_smmu(const VirtMachineState *vms, int irq = vms->irqmap[VIRT_SMMU]; int i; hwaddr base = vms->memmap[VIRT_SMMU].base; -hwaddr size = vms->memmap[VIRT_SMMU].size; DeviceState *dev; if (vms->iommu != VIRT_IOMMU_SMMUV3 || !vms->iommu_phandle) { @@ -1473,7 +1475,7 @@ static void create_smmu(const VirtMachineState *vms, sysbus_connect_irq(SYS_BUS_DEVICE(dev), i, qdev_get_gpio_in(vms->gic, irq + i)); } -create_smmuv3_dt_bindings(vms, base, size, irq); +create_smmuv3_dt_bindings(vms, base, SMMU_IO_LEN, irq); } static void create_virtio_iommu_dt_bindings(VirtMachineState *vms) Reviewed-by: Donald Dutile
Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC
On 5/2/25 6:27 AM, Shameer Kolothum wrote: Although this change does not affect functionality at present, it lays the groundwork for enabling user-created SMMUv3 devices in future patches Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3.c | 26 ++ hw/arm/virt.c | 3 ++- 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index ab67972353..605de9b721 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -24,6 +24,7 @@ #include "hw/qdev-properties.h" #include "hw/qdev-core.h" #include "hw/pci/pci.h" +#include "hw/pci/pci_bridge.h" #include "cpu.h" #include "exec/target_page.h" #include "trace.h" @@ -1874,6 +1875,25 @@ static void smmu_reset_exit(Object *obj, ResetType type) smmuv3_init_regs(s); } +static int smmuv3_pcie_bus(Object *obj, void *opaque) +{ +DeviceState *d = opaque; +PCIBus *bus; + +if (!object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) { +return 0; +} + +bus = PCI_HOST_BRIDGE(obj)->bus; +if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus->name)) { +object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus), + &error_abort); +/* Return non-zero as we got the bus and don't need further iteration.*/ +return 1; +} +return 0; +} + static void smmu_realize(DeviceState *d, Error **errp) { SMMUState *sys = ARM_SMMU(d); @@ -1882,6 +1902,10 @@ static void smmu_realize(DeviceState *d, Error **errp) SysBusDevice *dev = SYS_BUS_DEVICE(d); Error *local_err = NULL; +if (!object_property_get_link(OBJECT(d), "primary-bus", &error_abort)) { +object_child_foreach_recursive(object_get_root(), smmuv3_pcie_bus, d); +} + c->parent_realize(d, &local_err); if (local_err) { error_propagate(errp, local_err); @@ -1996,6 +2020,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data) device_class_set_parent_realize(dc, smmu_realize, &c->parent_realize); device_class_set_props(dc, smmuv3_properties); +dc->hotpluggable = false; +dc->bus_type = TYPE_PCIE_BUS; Does this force legacy SMMUv3 to be tied to a PCIe bus now? if so, will that break some existing legacy smmuv3 configs?, i.e., virtio-scsi attached to a legacy smmuv3. } static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu, diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 177f3dd22c..3bae4e374f 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -56,6 +56,7 @@ #include "qemu/cutils.h" #include "qemu/error-report.h" #include "qemu/module.h" +#include "hw/pci/pci_bus.h" #include "hw/pci-host/gpex.h" #include "hw/virtio/virtio-pci.h" #include "hw/core/sysbus-fdt.h" @@ -1442,7 +1443,7 @@ static void create_smmu(const VirtMachineState *vms, } object_property_set_link(OBJECT(dev), "primary-bus", OBJECT(bus), &error_abort); -sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), &error_fatal); +qdev_realize_and_unref(dev, &bus->qbus, &error_fatal); sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, base); for (i = 0; i < NUM_SMMU_IRQS; i++) { sysbus_connect_irq(SYS_BUS_DEVICE(dev), i,
Re: [PATCH v2 2/6] hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices
On 5/2/25 1:13 PM, Nicolin Chen wrote: On Fri, May 02, 2025 at 11:27:03AM +0100, Shameer Kolothum wrote: @@ -43,6 +43,7 @@ #include "hw/acpi/generic_event_device.h" #include "hw/acpi/tpm.h" #include "hw/acpi/hmat.h" +#include "hw/arm/smmuv3.h" #include "hw/pci/pcie_host.h" #include "hw/pci/pci.h" #include "hw/pci/pci_bus.h" @@ -266,6 +267,75 @@ static int iort_idmap_compare(gconstpointer a, gconstpointer b) return idmap_a->input_base - idmap_b->input_base; } +struct SMMUv3Device { +int irq; +hwaddr base; +GArray *smmu_idmaps; +size_t offset; +}; +typedef struct SMMUv3Device SMMUv3Device; "SMMUv3Device" sounds too general, like coming from smmuv3.h :-/ Given this describes SMMUv3's IORT node, I still think we should name it something like "IortSMMUv3Node" or so. +1.. the more generic name had me thinking it was broader than IORT.. the IORT-related naming is an improvement. +static int smmuv3_dev_idmap_compare(gconstpointer a, gconstpointer b) +{ +SMMUv3Device *sdev_a = (SMMUv3Device *)a; +SMMUv3Device *sdev_b = (SMMUv3Device *)b; +AcpiIortIdMapping *map_a = &g_array_index(sdev_a->smmu_idmaps, + AcpiIortIdMapping, 0); +AcpiIortIdMapping *map_b = &g_array_index(sdev_b->smmu_idmaps, + AcpiIortIdMapping, 0); +return map_a->input_base - map_b->input_base; +} + +static void +get_smmuv3_legacy_dev(VirtMachineState *vms, GArray * smmuv3_devices) GArray *smmuv3_devices Or maybe align with the non-legacy path, i.e. "sdev_blob"? Or the other way around. Otherwise, lgtm Reviewed-by: Nicolin Chen
Re: [PATCH 0/5] Add support for user creatable SMMUv3 device
On 4/22/25 4:57 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Donald Dutile Sent: Friday, April 18, 2025 9:34 PM To: Shameerali Kolothum Thodi ; qemu-...@nongnu.org; qemu-devel@nongnu.org Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [PATCH 0/5] Add support for user creatable SMMUv3 device Shameer, Hi! First off, like the partitioning of these pieces. On 4/15/25 4:10 AM, Shameer Kolothum wrote: Hi All, This patch series introduces support for a user-creatable SMMUv3 device (-device arm-smmuv3-dev) in QEMU. Can we drop the '-dev', as 'arm-smmu' is sufficient, as is '-device'? I know, internal to QEMU, there's already an ARM_SMMU, but as suggested later, if it uses the same struct, the qemu cmdline syntax is a bit less redundant. Trust me I tried that😊. The problem is that, the legacy one doesn't have a bus associated with it and the moment we change that and have bus specified for the "-device arm-smmuv3, bus=pcie.0" the legacy smmuv3 creation in virt, create_smmu() --> qdev_new(TYPE_ARM_SMMUV3) will fails as it expects the bus type to be specified for it. I couldn't find a way to solve that. I guessed that's why there was new syntax for what is basically an extension of previous syntax. I'll dig more into it and see if there's a way to handle it. qemu does handle varying added params (id=blah, no id=blah), so I'm researching how that's done, to figure out how the legacy, smmu-for-all, and the smmu-for-specific pcie can both be supported. Given your stated trials and tribulations, this should be fun. ;-) - Don The implementation is based on feedback received from the RFCv2[0]: "hw/arm/virt: Add support for user-creatable accelerated SMMUv3" Currently, QEMU's SMMUv3 emulation (iommu=smmuv3) is tied to the machine and does not support instantiating multiple SMMUv3 devices—each associated with a separate PCIe root complex. In contrast, real-world ARM systems often include multiple SMMUv3 instances, each bound to a different PCIe root complex. This also lays the groundwork for supporting accelerated SMMUv3, as proposed in the aforementioned RFC. Please note, the accelerated SMMUv3 support is not part of this series and will be sent out as a separate series later on top of this one. Summary of changes: -Introduces support for multiple -device arm-smmuv3-dev,bus=pcie.x instances. Example usage: -device arm-smmuv3-dev,bus=pcie.0 -device virtio-net-pci,bus=pcie.0 -device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3-dev,bus=pcie.1 -device pcie-root-port,id=pcie.port1,bus=pcie.1 -device virtio-net-pci,bus=pcie.port1 -Supports either the legacy iommu=smmuv3 option or the new "-device arm-smmuv3-dev" model. nice! :) Can it support both? i.e., some devices using a system-wide, legacy smmuv3, and some pcie devices using the -device arm-smmuv3 ? Please see my reply to patch #2. In short No, it doesn't support both. Also I think we should slowly deprecate the use of legacy SMMUv3 going forward unless there is a strong use case/reason to support it. If not, then add a check to min-warn or max-fail, that config. I can see old machines being converted/upgraded to new machines, and they will want to switch from legacy->device smmuv3, and catching an edit/update to a machine change (or enabling automated conversion) would be helpful. Please see Patch #4. It already checks and prevents if incompatible SMMUv3 types are specified. Or is the suggestion here is to add something extra? Please let me know. Thanks, Shameer
Re: [PATCH] pci-ids.rst: Add Red Hat pci-id for AMD IOMMU device
Hi Suravee! Not your issue, but wondering if others know: Why isn't this an issue for Intel-vtd-iommu & ARM-SMMUV3 ? Are they instantiated as non-PCI-id (platform) devices, but AMD puts their IOMMU in PCI space? Adv. thanks for the info. - Don On 3/4/25 1:37 PM, Suravee Suthikulpanit wrote: The QEMU-emulated AMD IOMMU PCI device is implemented based on the AMD I/O Virtualization Technology (IOMMU) Specification [1]. The PCI id for this device is platform-specific. Currently, the QEMU-emulated AMD IOMMU device is using AMD vendor id and undefined device id. Therefore, change the vendor id to Red Hat and request a new QEMU-specific device id. [1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf Cc: Gerd Hoffmann Signed-off-by: Suravee Suthikulpanit --- docs/specs/pci-ids.rst | 2 ++ hw/i386/amd_iommu.c| 3 ++- include/hw/pci/pci.h | 1 + 3 files changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst index 261b0f359f..2416a70a2d 100644 --- a/docs/specs/pci-ids.rst +++ b/docs/specs/pci-ids.rst @@ -100,6 +100,8 @@ PCI devices (other than virtio): PCI UFS device (``-device ufs``) 1b36:0014 PCI RISC-V IOMMU device +1b36:0015 + PCI AMD IOMMU device (``-device amd-iommu``) All these devices are documented in :doc:`index`. diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c index dda1a5781f..4d8564249c 100644 --- a/hw/i386/amd_iommu.c +++ b/hw/i386/amd_iommu.c @@ -1766,7 +1766,8 @@ static void amdvi_pci_class_init(ObjectClass *klass, void *data) DeviceClass *dc = DEVICE_CLASS(klass); PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); -k->vendor_id = PCI_VENDOR_ID_AMD; +k->vendor_id = PCI_VENDOR_ID_REDHAT; +k->device_id = PCI_DEVICE_ID_REDHAT_AMD_IOMMU; k->class_id = 0x0806; k->realize = amdvi_pci_realize; diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index 4002bbeebd..da44e6673d 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -117,6 +117,7 @@ extern bool pci_available; #define PCI_DEVICE_ID_REDHAT_ACPI_ERST 0x0012 #define PCI_DEVICE_ID_REDHAT_UFS 0x0013 #define PCI_DEVICE_ID_REDHAT_RISCV_IOMMU 0x0014 +#define PCI_DEVICE_ID_REDHAT_AMD_IOMMU 0x0015 #define PCI_DEVICE_ID_REDHAT_QXL 0x0100 #define FMT_PCIBUS PRIx64
Re: [PATCH] pci-ids.rst: Add Red Hat pci-id for AMD IOMMU device
On 3/4/25 9:56 PM, Suthikulpanit, Suravee wrote: On 3/5/2025 6:02 AM, Donald Dutile wrote: Hi Suravee! Not your issue, but wondering if others know: Why isn't this an issue for Intel-vtd-iommu & ARM-SMMUV3 ? Are they instantiated as non-PCI-id (platform) devices, but AMD puts their IOMMU in PCI space? Adv. thanks for the info. Unlike AMD IOMMU, Intel VT-d IOMMU and ARM SMMUV3 are not enumerated as a PCI device in the system. Thanks, Suravee ok. thanks for confirmation. --dd
Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device
On 3/19/25 1:00 PM, Eric Auger wrote: Hi, On 3/19/25 1:23 AM, Jason Gunthorpe wrote: On Tue, Mar 18, 2025 at 05:22:51PM -0400, Donald Dutile wrote: I agree with Eric that 'accel' isn't needed -- this should be ascertained from the pSMMU that a physical device is attached to. I seem to remember the point was made that we don't actually know if accel is possible, or desired, especially in the case of hotplug. that's why I think it would be better if we could instantiate a single type of device that can do both accel and non accel mode. Maybe that would be at the price of always enforcing MSI resv regions on guest to assure MSI nesting is possible. The accelerated mode has a number of limitations that the software mode does not have. I think it does make sense that the user would deliberately choose to use a more restrictive operating mode and then would have to meet the requirements - eg by creating the required number and configuration of vSMMUs. To avoid any misunderstanding I am not pushing for have a single vSMMU instance. I advocate for having several instances, each somehow specialized for VFIO devices or emulated devices. Maybe we can opt-in with accel=on but the default could be auto (the property can be AUTO_ON_OFF) where the code detects if a VFIO device is translated.In case incompatible devices are translated into a same vSMMU instance I guess it could be detected and will fail. What I am pusshing for is to have a single type of QEMU device which can do both accel and non accel. +1 ! In general I advocate for having several vSMMU instances, each of them Now... how does vfio(?; why not qemu?) layer determine that? -- where are SMMUv3 'accel' features exposed either: a) in the device struct (for the smmuv3) or (b) somewhere under sysfs? ... I couldn't find anything under either on my g-h system, but would appreciate a ptr if there is. I think it is not discoverable yet other thatn through try-and-fail. Discoverability would probably be some bits in an iommufd GET_INFO ioctl or something like that. yeah but at least we can easily detect if a VFIO device is beeing translated by a vSMMU instance in which case there is no other choice to turn accel on. Thanks Eric and like Eric, although 'accel' is better than the original 'nested', it's non-obvious what accel feature(s) are being turned on, or not. There are really only one accel feature - direct HW usage of the IO Page table in the guest (no shadowing). A secondary addon would be direct HW usage of an invalidation queue in the guest. kernel boot-param will be needed; if in sysfs, a write to 0 an enable(disable) it maybe an alternative as well. Bottom line: we need a way to (a) ascertain the accel feature (b) a way to disable it when it is broken, so qemu's smmuv3 spec will 'just work'. You'd turned it off by not asking qemu to use it, that is sort of the reasoning behind the command line opt in for accel or not. Jason
Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC
On 5/7/25 4:50 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Markus Armbruster Sent: Wednesday, May 7, 2025 8:17 AM To: Donald Dutile Cc: Shameer Kolothum via ; qemu- a...@nongnu.org; Shameerali Kolothum Thodi ; eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC Donald Dutile writes: [...] In this series, an iommu/smmu needs to be placed -BETWEEN- a sysbus and a PCIe-tree, or step-wise, plug an smmuv3 into a sysbus, and a pcie tree/domain/RC into an SMMUv3. RC = root complex? Yes. +1. So, an smmu needs to be associated with a bus (tree), i.e., pcie.0, pcie.1... One could model it as a PCIe device, attached at the pcie-RC ... but that's not how it's modelled in ARM hw. Physical ARM hardware? yes, physical hw. Assuming the virtual devices and buses we're discussing model physical devices and buses: * What are the physical devices of interest? * How are they wired together? Which of the wires are buses, in particular PCI buses? SMMUv3 is a platform device and its placement in a system is typically as below for PCI devices, +--+ | PCIe Devices | +--+ | v +-+ +---+ | PCIe RC A |<>| Interconnect | +-+ +---+ | | +--v---+ | SMMUv3.A | | (IOMMU) | ++-+ | v +---++ | System RAM | ++ This patch is attempting to establish that association between the PCIe RC and the SMMUv3 device so that Qemu can build the ACPI tables/DT iommu mappings for the SMMUv3 device. I would refer to the ARM SMMU spec, Figure 2.3 in the G.a version, where it's slightly different; more like: +--+ | PCIe Devices | (one device, unless a PCIe switch is btwn the RC & 'Devices'; +--+ or, see more typical expansion below) | +-+ | PCIe RC A | +-+ | +--v---++---+ | SMMUv3.A || Wide assortment of other platform | | (IOMMU) || devices not using SMMU | ++-++---+ | | | | +--+--+---+---+-+ | System Interconnect | +---+ | +---++ +-+-+ | System RAM |<--->| CPU (NUMA socket) | ++ +---+ In fact, the PCIe can be quite complex with PCIe bridges, and multiple Root Ports (RP's), and multiple SMMU's: +--+ +--+ +--+ | PCIe Device | | PCIe Device | | PCIe Device | +--+ +--+ +--+ | | |<--- PCIe bus +--+ +--+ +--+ | PCIe RP | | PCIe RP | | PCIe RP | <- may be PCI Bridge, may not +--+ +--+ +--+ | | | +--+ +--+ +--+ | SMMU| | SMMU| | SMMU| +--+ +--+ +--+ | | | <- may be a bus, may not(hidden from OS) +--+--+ | +--+ | PCI RC A| +--+ where PCIe RP's could be represented (even virtually) in -hw- as a PCIe bridge, each downstream being a different PCIe bus under a single PCIe RC (A, in above pic) -domain-. ... or the RPs don't have to have a PCIe bridge, and look like 'just an RP' that provides a PCIe (pt-to-pt, serial) bus, provided by a PCIe RC. ... the PCIe architecture allows both, and I've seen both implementations in hw (at least from an lspci perspective). You can see the above hw implementation by doing an lspci on most PCIe systems (definitely common on x86), where the RP's are represented by 'PCIe bridge' elements -- and lots of them. In real hw, these RP's effectively become (multiple) up
Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC
On 5/6/25 7:47 AM, Markus Armbruster wrote: Shameer Kolothum via writes: Although this change does not affect functionality at present, it lays the groundwork for enabling user-created SMMUv3 devices in future patches Signed-off-by: Shameer Kolothum --- hw/arm/smmuv3.c | 26 ++ hw/arm/virt.c | 3 ++- 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c index ab67972353..605de9b721 100644 --- a/hw/arm/smmuv3.c +++ b/hw/arm/smmuv3.c @@ -24,6 +24,7 @@ #include "hw/qdev-properties.h" #include "hw/qdev-core.h" #include "hw/pci/pci.h" +#include "hw/pci/pci_bridge.h" #include "cpu.h" #include "exec/target_page.h" #include "trace.h" @@ -1874,6 +1875,25 @@ static void smmu_reset_exit(Object *obj, ResetType type) smmuv3_init_regs(s); } +static int smmuv3_pcie_bus(Object *obj, void *opaque) +{ +DeviceState *d = opaque; +PCIBus *bus; + +if (!object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) { +return 0; +} + +bus = PCI_HOST_BRIDGE(obj)->bus; +if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus->name)) { +object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus), + &error_abort); +/* Return non-zero as we got the bus and don't need further iteration.*/ +return 1; +} +return 0; +} + static void smmu_realize(DeviceState *d, Error **errp) { SMMUState *sys = ARM_SMMU(d); @@ -1882,6 +1902,10 @@ static void smmu_realize(DeviceState *d, Error **errp) SysBusDevice *dev = SYS_BUS_DEVICE(d); Error *local_err = NULL; +if (!object_property_get_link(OBJECT(d), "primary-bus", &error_abort)) { +object_child_foreach_recursive(object_get_root(), smmuv3_pcie_bus, d); +} + c->parent_realize(d, &local_err); if (local_err) { error_propagate(errp, local_err); @@ -1996,6 +2020,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data) device_class_set_parent_realize(dc, smmu_realize, &c->parent_realize); device_class_set_props(dc, smmuv3_properties); +dc->hotpluggable = false; +dc->bus_type = TYPE_PCIE_BUS; This is very, very wrong. The function serves as .class_init() for QOM type "arm-smmuv3", defined as: static const TypeInfo smmuv3_type_info = { .name = TYPE_ARM_SMMUV3, .parent= TYPE_ARM_SMMU, Subtype of "arm-smmuv3". .instance_size = sizeof(SMMUv3State), .instance_init = smmuv3_instance_init, .class_size= sizeof(SMMUv3Class), .class_init= smmuv3_class_init, }; static const TypeInfo smmu_base_info = { .name = TYPE_ARM_SMMU, .parent= TYPE_SYS_BUS_DEVICE, Subtype of "sys-bus-device". .instance_size = sizeof(SMMUState), .class_data= NULL, .class_size= sizeof(SMMUBaseClass), .class_init= smmu_base_class_init, .abstract = true, }; Have a look at the instance struct: struct SMMUv3State { SMMUState smmu_state; Starts with the supertype's instance struct, as is proper. uint32_t features; [more ...] }; Here's the supertype's instance struct: struct SMMUState { /* */ SysBusDevice dev; Again, starts with the supertype's instance struct. const char *mrtypename; [more...] }; This is a sysbus device, not a PCI device. Monkey-patching dc->bus_type from TYPE_SYSTEM_BUS to TYPE_PCIE_BUS won't change that. All it accomplishes is making the qdev core believe it plugs into a PCIE bus. This can only end in tears. In fact, when I build with the entire series applied (so the device can actually be used with -device), the result dies within seconds of my ad hoc testing: $ qemu-system-aarch64 -nodefaults -S -display none -monitor stdio -M virt -device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3,bus=pcie.1 QEMU 10.0.50 monitor - type 'help' for more information qemu-system-aarch64: -device arm-smmuv3,bus=pcie.1: warning: SMMUv3 device only supported with pcie.0 for DT (qemu) info qtree bus: main-system-bus type System dev: pxb-host, id "" x-config-reg-migration-enabled = true bypass-iommu = false bus: pcie.1 type pxb-pcie-bus dev: arm-smmuv3, id "" gpio-out "sysbus-irq" 4 stage = "nested" bus_num = 0 (0x0) Segmentation fault (core dumped) Backtrace: #0 0x557d8521 in pcibus_dev_print (mon=0x5826d0e0, dev=0x590ad360, indent=8) at ../hw/pci/pci-hmp-cmds.c:140 #1 0x55eac0a0 in bus_print_dev (bus=, mon=, dev=0x590ad360, indent=8) at ../system/qdev-monitor.c:773 #2 qdev_print (mon=, dev=, indent=8) at ../system/q
Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC
On 5/8/25 9:57 AM, Peter Maydell wrote: On Thu, 8 May 2025 at 14:46, Donald Dutile wrote: I would refer to the ARM SMMU spec, Figure 2.3 in the G.a version, where it's slightly different; more like: +--+ | PCIe Devices | (one device, unless a PCIe switch is btwn the RC & 'Devices'; +--+ or, see more typical expansion below) | +-+ | PCIe RC A | +-+ | +--v---++---+ | SMMUv3.A || Wide assortment of other platform | | (IOMMU) || devices not using SMMU | ++-++---+ | | | | +--+--+---+---+-+ | System Interconnect | +---+ | +---++ +-+-+ | System RAM |<--->| CPU (NUMA socket) | ++ +---+ In fact, the PCIe can be quite complex with PCIe bridges, and multiple Root Ports (RP's), and multiple SMMU's: +--+ +--+ +--+ | PCIe Device | | PCIe Device | | PCIe Device | +--+ +--+ +--+ | | |<--- PCIe bus +--+ +--+ +--+ | PCIe RP | | PCIe RP | | PCIe RP | <- may be PCI Bridge, may not +--+ +--+ +--+ | | | +--+ +--+ +--+ | SMMU| | SMMU| | SMMU| +--+ +--+ +--+ | | | <- may be a bus, may not(hidden from OS) +--+--+ | +--+ | PCI RC A| +--+ The final take away: the (QEMU) SMMU/IOMMU must be associated with a PCIe bus OR, the format has to be something like: -device smmuv3, id=smmuv3.1 -device , smmu=smmuv3.1 where the device <-> SMMU (or if extended to x86, iommu) associativity is set w/o bus associativity. The problem here seems to me to be that in the hardware we're modelling the SMMU always exists, because it's in the SoC, but you're trying to arrange for it to be created on the command line, via -device. We don't have any of these problems with the current 'virt' board code, because we have the board code create the iommu (if the user asks for it via the iommu machine property), and it can wire it up to the PCI root complex as needed. thanks -- PMM Peter, Hey! Learning more about the use-of/focus/expectation of '-device', I see your point. The issue is that there isn't just one SMMU in a system (as the diagram above illustrates), they have associativity to (pcie) devices (possibly more accurately, a pcie-(sub) tree) (for the focus of this discussion; could be non-pcie devices), and we need an option format to associate smmu to pcie(domain/subtree(RPs), devices), and that typically references busses (pxbs?) &/or pcie devices (-device). ... or do we only need to add iommu associativity to devices, ie, the second option line above, -device , iommu=smmuv3.1 ... which should enable the right IORT(s) to be made for multiple smmus/iommus. - Don
Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
On 5/20/25 5:13 AM, Duan, Zhenzhong wrote: -Original Message- From: Donald Dutile Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Hey Zhenzhong, Thanks for feedback. replies below. - Don On 5/19/25 4:37 AM, Duan, Zhenzhong wrote: Hi Donald, -Original Message- From: Donald Dutile Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong, Hi! Eric asked me to review this series. Since it's rather late since you posted will summarize review feedback below/bottom. - Don On 2/19/25 3:22 AM, Zhenzhong Duan wrote: ... Did you ever put some tracing in to capture avg hits in cache? ... if so, add as a comment. Otherwise, looks good. Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative to 2- stage mapping, or SIOV. Could you add verbage to explain the use of it, as the rest of this patch doesn't make any sense to me without the background. The patch introduces hw-info-type (none or intel), and then goes on to add a large set of checks; seems like the caps & this checking should go together (split for each cap; put all caps together & the check...). OK, will do. There are some explanations in cover-letter. For history reason, old vtd spec define stage-1 as first level then switch to first stage. So 'flts' is 'first level then switch' . Sorry for confusion, it stands for 'first level translation support'. Thanks. Patch 12: Why isn't HostIOMMUDevice extended to have another iommu- specif element, opaque in HostIOMMUDevice, but set to specific IOMMU in use? e.g. void *hostiommustate; Yes, that's possible, but we want to make a generic interface between VFIO/VDPA and vIOMMU. ok. I don't understand how VFIO & VPDA complicate that add. IIUC, the hostiommustate provided by VFIO and VDPA may be different format. By using a general interface like .get_cap(), we hide the resolving under VFIO and VDPA backend. This is like the KVM extension checking between QEMU and KVM. FYI, there was some discuss on the interface before, see https://lists.gnu.org/archive/html/qemu-devel/2024-04/msg02658.html Good analogy, thanks. I'll reach out to Cedric on the above discussion as well. Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each PASID have its own address space? Yes, it is. So, why isn't it handle via a uniqe AS cache like 'any other device'? Maybe I'm thinking too SMMU-StreamID, which can be varying length, depending on subsystem support. I see what appears to be sid+pasid calls to do the AS lookups; hmm, so maybe this is the generalized BDF+pasid AS lookup? if so, maybe a better description stating this transition to a wider stream-id would set the code context better. Not quite get.. I'm looking for a better description that states the AS cache lookup is broadened from bdf to bdf+pasid. Guess you mean vtd_as_from_iommu_pasid(), it's a variant of vtd_find_add_as(). We support AS cache lookup by bdf+pasid for a long time, see vtd_find_add_as(). Thanks for clarification. As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel- iommu to determine if its all correct or not. Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says it uses a 2-stage page table in each case, but it implies it should be different. At 580 lines of code changes, you win! ;-) The host side's using nested or only stage-2 page table depends on PGTT's setting in guest. Thanks for clarification. Patch 15: Read-only and Read/write areas have different IOMMUFDs? is that an intel-iommu requriement? At least this intel-iommu-errata code is only in hw/i386/<> modules. No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a new VTDIOASContainer to hold only read/write areas mapping. We can use same IOMMUFDs for different VTDIOASContainer. ah yes; I got hung-up on different mappings, and didn't back up to AS-container split & same IOMMUFD. Patch 16: Looks reasonable. What does the 'SI' mean after "CACHE_DEV", "CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation? VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation', VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'. That explanation helps. :) maybe put a short blurb in the commit log, or code, so one doesn't have to be a ninja-VTD spec consumer to comprehend those (important) diffs. Good idea, will do. Thanks Zhenzhong Again, thanks for the reply. Looking fwd to the rfcv3 (on list) or move to v1-POST. Thanks for your comments! BRs, Zhenzhong Thanks for the added clarifications. - Don
Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
Hey Zhenzhong, Thanks for feedback. replies below. - Don On 5/19/25 4:37 AM, Duan, Zhenzhong wrote: Hi Donald, -Original Message- From: Donald Dutile Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong, Hi! Eric asked me to review this series. Since it's rather late since you posted will summarize review feedback below/bottom. - Don On 2/19/25 3:22 AM, Zhenzhong Duan wrote: Hi, Per Jason Wang's suggestion, iommufd nesting series[1] is split into "Enable stage-1 translation for emulated device" series and "Enable stage-1 translation for passthrough device" series. This series is 2nd part focusing on passthrough device. We don't do shadowing of guest page table for passthrough device but pass stage-1 page table to host side to construct a nested domain. There was some effort to enable this feature in old days, see [2] for details. The key design is to utilize the dual-stage IOMMU translation (also known as IOMMU nested translation) capability in host IOMMU. As the below diagram shows, guest I/O page table pointer in GPA (guest physical address) is passed to host and be used to perform the stage-1 address translation. Along with it, modifications to present mappings in the guest I/O page table should be followed with an IOTLB invalidation. .-. .---. | vIOMMU| | Guest I/O page table | | | '---' ./ | PASID Entry |--- PASID cache flush --+ '-'| | |V | | I/O page table pointer in GPA '-' Guest --| Shadow |---| vv v Host .-. .. | pIOMMU| | FS for GIOVA->GPA | | | '' ./ | | PASID Entry | V (Nested xlate) '\.--. | | | SS for GPA->HPA, unmanaged domain| | | '--' '-' Where: - FS = First stage page tables - SS = Second stage page tables I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second stage. We don't need different terms for the same technology in the iommu/iommufd space(s). OK, then I'd like to use stage1 and stage2 everywhere which is more verbose. your choice; in other kernel & qemu code I've seen, 's1' and 's2' are used quite frequently, that's why I recommended it -- call it plagerizing! ;-) There are some interactions between VFIO and vIOMMU * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI subsystem. VFIO calls them to register/unregister HostIOMMUDevice instance to vIOMMU at vfio device realize stage. * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt to bind/unbind device to IOMMUFD backed domains, either nested domain or not. See below diagram: VFIO Device Intel IOMMU .-. .---. | | | | | .-|PCIIOMMUOps |.-.| | | IOMMUFD |(set_iommu_device) || Host IOMMU || | | Device |>|| Device list || | .-|(unset_iommu_device) |.-.| | | | | | | | | V | | .-| HostIOMMUDeviceIOMMUFD | .-. | | | IOMMUFD |(attach_hwpt)| | Host IOMMU | | | | link|<| | Device| | | .-|(detach_hwpt)| .-. | | | | | | | | | ... | .-. .---. Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt whenever possible and create new one on demand, also supports multiple iommufd objects and ERRATA_772415. E.g., Stage-2 page table could be shared by different devices if there is no conflict and devices link to same iommufd object, i.e. devices under same host IOMMU can share same stage-2 page table. If there is and
Re: [PATCH] hw/arm/virt: Check bypass iommu is not set for iommu-map DT property
On 6/2/25 7:46 AM, Shameer Kolothum wrote: default_bus_bypass_iommu tells us whether the bypass_iommu is set for the default PCIe root bus. Make sure we check that before adding the "iommu-map" DT property. Fixes: 6d7a85483a06 ("hw/arm/virt: Add default_bus_bypass_iommu machine option") Suggested-by: Eric Auger Signed-off-by: Shameer Kolothum --- hw/arm/virt.c | 15 ++- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/hw/arm/virt.c b/hw/arm/virt.c index 9a6cd085a3..99fde5836c 100644 --- a/hw/arm/virt.c +++ b/hw/arm/virt.c @@ -1487,9 +1487,12 @@ static void create_virtio_iommu_dt_bindings(VirtMachineState *vms) qemu_fdt_setprop_cell(ms->fdt, node, "phandle", vms->iommu_phandle); g_free(node); -qemu_fdt_setprop_cells(ms->fdt, vms->pciehb_nodename, "iommu-map", - 0x0, vms->iommu_phandle, 0x0, bdf, - bdf + 1, vms->iommu_phandle, bdf + 1, 0x - bdf); +if (!vms->default_bus_bypass_iommu) { +qemu_fdt_setprop_cells(ms->fdt, vms->pciehb_nodename, "iommu-map", + 0x0, vms->iommu_phandle, 0x0, bdf, + bdf + 1, vms->iommu_phandle, bdf + 1, + 0x - bdf); +} } static void create_pcie(VirtMachineState *vms) @@ -1612,8 +1615,10 @@ static void create_pcie(VirtMachineState *vms) switch (vms->iommu) { case VIRT_IOMMU_SMMUV3: create_smmu(vms, vms->bus); -qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", - 0x0, vms->iommu_phandle, 0x0, 0x1); +if (!vms->default_bus_bypass_iommu) { +qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", + 0x0, vms->iommu_phandle, 0x0, 0x1); +} break; default: g_assert_not_reached(); Reviewed-by: Donald Dutile
Re: [PATCH v4 7/7] qemu-options.hx: Document the arm-smmuv3 device
On 6/16/25 7:12 AM, Jonathan Cameron wrote: On Fri, 13 Jun 2025 15:44:49 +0100 Shameer Kolothum wrote: Now that arm,virt can have user-creatable smmuv3 devices, document it. Signed-off-by: Shameer Kolothum --- qemu-options.hx | 6 ++ 1 file changed, 6 insertions(+) diff --git a/qemu-options.hx b/qemu-options.hx index 7eb8e02b4b..3edbde45bb 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -1226,6 +1226,12 @@ SRST ``aw-bits=val`` (val between 32 and 64, default depends on machine) This decides the address width of the IOVA address space. +``-device arm-smmuv3,primary-bus=id`` +This is only supported by ``-machine virt`` (ARM). + +``primary-bus=id`` +The PCIe Root Complex to be associated with. Hmm. Root complex or host bridge? I think an RC is allowed to have multiple heirarchy and hence multiple host bridges. Figure 1.2 in the PCI spec. So my gut feeling is this should be host bridge. +1. the key word-hint: 'complex' -- a RP (a Root *Port*) can only host a single PCI(e) (sub-)tree, but a RC can have multiple PCI domains, not to mention a bunch of platform-level, acpi-defined PCI(e) devices. + ERST DEF("name", HAS_ARG, QEMU_OPTION_name,
Re: [PATCH v4 1/7] hw/arm/smmu-common: Check SMMU has PCIe Root Complex association
On 6/17/25 12:52 PM, Jonathan Cameron wrote: On Tue, 17 Jun 2025 09:49:54 +0200 Eric Auger wrote: On 6/16/25 12:20 PM, Jonathan Cameron wrote: On Fri, 13 Jun 2025 15:44:43 +0100 Shameer Kolothum wrote: Although this change does not affect functionality at present, it is Patch title says PCIe. This check is vs PCI host bridge. No idea which one you wanted, but if it is PCIe needs to be TYPC_PCIE_HOST_BRIDGE from pcie_host.h not the pci_host.h one I think. I think we need TYPE_PCI_HOST_BRIDGE as we want to check against pxb pci-bridge/pci_expander_bridge.c: .parent = TYPE_PCI_HOST_BRIDGE, Hmm. That's awkward and I'd forgotten that wrinkle. Need a stronger test but which one? The PXB root bus has a parent of TYPE_PCIE_BUS. Maybe we can check that? Yeah, rooted in the problem that we should have pci-switches or pci-rp's, not just 'pxb'. Besides the commit title/desc suggested by Jonathan you may also refine the error msg which can stem now from 2 different causes Thanks Eric required when we add support for user-creatable SMMUv3 devices in future patches. Tested-by: Nathan Chen Signed-off-by: Shameer Kolothum --- hw/arm/smmu-common.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c index f39b99e526..7890aa12c1 100644 --- a/hw/arm/smmu-common.c +++ b/hw/arm/smmu-common.c @@ -20,6 +20,7 @@ #include "trace.h" #include "exec/target_page.h" #include "hw/core/cpu.h" +#include "hw/pci/pci_bridge.h" #include "hw/qdev-properties.h" #include "qapi/error.h" #include "qemu/jhash.h" @@ -937,7 +938,8 @@ static void smmu_base_realize(DeviceState *dev, Error **errp) g_free, g_free); s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL); -if (s->primary_bus) { +if (s->primary_bus && object_dynamic_cast(OBJECT(s->primary_bus)->parent, + TYPE_PCI_HOST_BRIDGE)) { pci_setup_iommu(s->primary_bus, &smmu_ops, s); } else { error_setg(errp, "SMMU is not attached to any PCI bus!");
Re: [PATCH v3 0/6] hw/arm/virt: Add support for user creatable SMMUv3 device
On 6/4/25 10:02 PM, Nathan Chen wrote: On 6/2/2025 8:41 AM, Shameer Kolothum wrote: This patch series introduces support for a user-creatable SMMUv3 device (-device arm-smmuv3) in QEMU. Tested-by: Nathan Chen I am able to create 16 SMMUv3 devices in a qemu VM with emulated devices properly associated with the guest SMMUs in guest sysfs - verified with some guest SMMUs having two or three emulated NICs assigned to them while other guest SMMUs have a minimum of one assigned. Thanks, Nathan Nathan, Great test! Can you share the xml &/or qemu cmdline, to demonstrate this broad capability? Also would be a good reference when (we know it isn't 'if') a new/different config that wasn't tested shows a bug, and we'll have more info to work with. Thanks, Don
Re: [PATCH v3 0/6] hw/arm/virt: Add support for user creatable SMMUv3 device
On 6/5/25 1:58 PM, Nathan Chen wrote: On 6/4/2025 7:34 PM, Donald Dutile wrote: On 6/4/25 10:02 PM, Nathan Chen wrote: On 6/2/2025 8:41 AM, Shameer Kolothum wrote: This patch series introduces support for a user-creatable SMMUv3 device (-device arm-smmuv3) in QEMU. Tested-by: Nathan Chen I am able to create 16 SMMUv3 devices in a qemu VM with emulated devices properly associated with the guest SMMUs in guest sysfs - verified with some guest SMMUs having two or three emulated NICs assigned to them while other guest SMMUs have a minimum of one assigned. Thanks, Nathan Nathan, Great test! Can you share the xml &/or qemu cmdline, to demonstrate this broad capability? Also would be a good reference when (we know it isn't 'if') a new/ different config that wasn't tested shows a bug, and we'll have more info to work with. Yes, here is the qemu command line I used. smmuv3.15 is associated with two devices, smmuv3.0 is associated with two devices, and smmuv3.1 is associated with three devices: qemu-system-aarch64 \ -machine hmat=on -machine virt,accel=kvm,gic-version=3,ras=on \ -cpu host -smp cpus=4 -m size=16G,slots=4,maxmem=32G -nographic \ -bios /usr/share/AAVMF/AAVMF_CODE.fd \ -device nvme,drive=nvme0,serial=deadbeaf1,bus=pcie.0 \ -drive file=/home/nvidia/nathanc/noble-server-cloudimg-arm64.qcow2,index=0,media=disk,format=qcow2,if=none,id=nvme0 \ -device e1000,romfile=/root/nathanc/efi-e1000.rom,netdev=net0,bus=pcie.0 \ -netdev user,id=net0,hostfwd=tcp::5558-:22,hostfwd=tcp::5586-:5586 \ -netdev user,id=net1 \ -netdev user,id=net2 \ -netdev user,id=net3 \ -netdev user,id=net4 \ -netdev user,id=net5 \ -netdev user,id=net6 \ -netdev user,id=net7 \ -netdev user,id=net8 \ -netdev user,id=net9 \ -netdev user,id=net10 \ -netdev user,id=net11 \ -netdev user,id=net12 \ -netdev user,id=net13 \ -netdev user,id=net14 \ -netdev user,id=net15 \ -netdev user,id=net16 \ -netdev user,id=net17 \ -netdev user,id=net18 \ -netdev user,id=net19 \ -netdev user,id=net20 \ -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.0 \ -device virtio-net-pci,bus=pcie.0,netdev=net1 \ -device virtio-net-pci,bus=pcie.0,netdev=net20 \ -device pxb-pcie,id=pcie.1,bus_nr=200,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1 \ -device pcie-root-port,id=pcie.port1,bus=pcie.1,slot=1,chassis=1 \ -device virtio-net-pci,bus=pcie.port1,netdev=net2 \ -device pcie-root-port,id=pcie.port17,bus=pcie.1,slot=17,chassis=17 \ -device virtio-net-pci,bus=pcie.port17,netdev=net18 \ -device pcie-root-port,id=pcie.port18,bus=pcie.1,slot=18,chassis=18 \ -device virtio-net-pci,bus=pcie.port18,netdev=net19 \ -device pxb-pcie,id=pcie.2,bus_nr=196,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.2,id=smmuv3.2 \ -device pcie-root-port,id=pcie.port2,bus=pcie.2,slot=2,chassis=2 \ -device virtio-net-pci,bus=pcie.port2,netdev=net3 \ -device pxb-pcie,id=pcie.3,bus_nr=192,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.3,id=smmuv3.3 \ -device pcie-root-port,id=pcie.port3,bus=pcie.3,slot=3,chassis=3 \ -device virtio-net-pci,bus=pcie.port3,netdev=net4 \ -device pxb-pcie,id=pcie.4,bus_nr=188,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.4,id=smmuv3.4 \ -device pcie-root-port,id=pcie.port4,bus=pcie.4,slot=4,chassis=4 \ -device virtio-net-pci,bus=pcie.port4,netdev=net5 \ -device pxb-pcie,id=pcie.5,bus_nr=184,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.5,id=smmuv3.5 \ -device pcie-root-port,id=pcie.port5,bus=pcie.5,slot=5,chassis=5 \ -device virtio-net-pci,bus=pcie.port5,netdev=net6 \ -device pxb-pcie,id=pcie.6,bus_nr=180,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.6,id=smmuv3.6 \ -device pcie-root-port,id=pcie.port6,bus=pcie.6,slot=6,chassis=6 \ -device virtio-net-pci,bus=pcie.port6,netdev=net7 \ -device pxb-pcie,id=pcie.7,bus_nr=176,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.7,id=smmuv3.7 \ -device pcie-root-port,id=pcie.port7,bus=pcie.7,slot=7,chassis=7 \ -device virtio-net-pci,bus=pcie.port7,netdev=net8 \ -device pxb-pcie,id=pcie.8,bus_nr=172,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.8,id=smmuv3.8 \ -device pcie-root-port,id=pcie.port8,bus=pcie.8,slot=8,chassis=8 \ -device virtio-net-pci,bus=pcie.port8,netdev=net9 \ -device pxb-pcie,id=pcie.9,bus_nr=168,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.9,id=smmuv3.9 \ -device pcie-root-port,id=pcie.port9,bus=pcie.9,slot=9,chassi
Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
On 7/9/25 3:20 PM, Nicolin Chen wrote: On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote: +enum { +VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */ +}; Thanks for this work. I am happy to see that we can share the common code that allocates a NESTING_PARENT in the core using this flag. Yet on ARM, a STAGE1 page table isn't always a nested S1, the hardware accelerated one. More often, it can be just a regular 1-stage translation table via emulated translation code and an emulated iotlb. Because the user-created smmuv3 started as 'accelerated smmuv3', and had been 'de-accelerated' to simply 'user created smmuv3', I'm looking for some clarification in the above statement/request. Is the above suppose to reflect that a nested IOMMU has some hw-acceleration in its Stage1 implementation? If so, then call it that: STAGE1_ACCEL. If it's suppose to represent that an IOMMU has nested/2-stage support, then the above is a valid cap; -but-, having a nested/2-stage support IOMMU doesn't necessarily mean its accelerated. Well, there are an emulated "nested" mode and an hw-accelerated "nested" mode in the smmuv3 code, so we had to choose something like "accel" over "nested". Here, on the other hand, I think the core using this CAP would unlikely care about an emulated "nested" mode in the individual vIOMMU.. So I suggested: /* hardware-accelerated nested stage-1 page table support */ VIOMMU_CAP_NESTED_S1 = BIT_ULL(0), which it should be clear IMHO. If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"? Thanks Nicolin If the distinction is hw-based s1 vs emulated-based s1, than I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration' feature/option in the SMMU spec. Thanks, - Don
Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
On 7/8/25 8:39 PM, Nicolin Chen wrote: On Tue, Jul 08, 2025 at 07:05:43AM -0400, Zhenzhong Duan wrote: diff --git a/include/hw/iommu.h b/include/hw/iommu.h new file mode 100644 index 00..e80aaf4431 --- /dev/null +++ b/include/hw/iommu.h @@ -0,0 +1,16 @@ +/* + * General vIOMMU capabilities, flags, etc + * + * Copyright (C) 2025 Intel Corporation. + * + * SPDX-License-Identifier: GPL-2.0-or-later + */ + +#ifndef HW_IOMMU_H +#define HW_IOMMU_H + +enum { +VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */ +}; Thanks for this work. I am happy to see that we can share the common code that allocates a NESTING_PARENT in the core using this flag. Yet on ARM, a STAGE1 page table isn't always a nested S1, the hardware accelerated one. More often, it can be just a regular 1-stage translation table via emulated translation code and an emulated iotlb. Because the user-created smmuv3 started as 'accelerated smmuv3', and had been 'de-accelerated' to simply 'user created smmuv3', I'm looking for some clarification in the above statement/request. Is the above suppose to reflect that a nested IOMMU has some hw-acceleration in its Stage1 implementation? If so, then call it that: STAGE1_ACCEL. If it's suppose to represent that an IOMMU has nested/2-stage support, then the above is a valid cap; -but-, having a nested/2-stage support IOMMU doesn't necessarily mean its accelerated. So, let's ensure terminology, esp. bit-macro names(pace) reflects the (exact) meaning, and not any indirect meaning. A recent kvm-arm email thread had such indirect naming cleaned up when referring to pfn-map/device-map/struct-page-mapped pages, which I'd like not to start down a similar mis-naming/indirect-naming path here. Look forward to the clarification. - Don I think this flag should indicate that the vIOMMU supports a HW-accelerated nested S1 HWPT allocation/invalidation. So, perhaps: /* hardware-accelerated nested stage-1 page table support */ VIOMMU_CAP_NESTED_S1 = BIT_ULL(0), ? Nicolin
Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval
On 7/10/25 12:21 PM, Shameerali Kolothum Thodi wrote: -Original Message- From: Donald Dutile Sent: Thursday, July 10, 2025 5:00 PM To: Shameerali Kolothum Thodi ; Nicolin Chen Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; berra...@redhat.com; imamm...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; gustavo.rom...@linaro.org; m...@redhat.com; marcel.apfelb...@gmail.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval On 7/10/25 3:37 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Nicolin Chen Sent: Thursday, July 10, 2025 1:07 AM To: Shameerali Kolothum Thodi Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; ddut...@redhat.com; berra...@redhat.com; imamm...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; gustavo.rom...@linaro.org; m...@redhat.com; marcel.apfelb...@gmail.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval On Wed, Jul 09, 2025 at 08:20:35AM +, Shameerali Kolothum Thodi wrote: On Tue, Jul 08, 2025 at 04:40:50PM +0100, Shameer Kolothum wrote: @@ -2909,6 +2909,19 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, } } +/* + * When multiple PCI Express Root Buses are defined using + pxb- pcie, + * the IOMMU configuration may be specific to each root bus. However, + * pxb-pcie acts as a special root complex whose parent + is effectively + * the default root complex(pcie.0). Ensure that we retrieve the + * correct IOMMU ops(if any) in such cases. + */ +if (pci_bus_is_express(iommu_bus) && pci_bus_is_root(iommu_bus)) { +if (!iommu_bus->iommu_per_bus && parent_bus- iommu_per_bus) { +break; Mind elaborating why the current bus must unset iommu_per_bus while its parent sets iommu_per_bus? My understanding is that for a pxb-pcie we should set iommu_per_bus but for its parent (the default root complex) we should unset its iommu_per_bus? Well, for new arm-smmuv3 dev you need an associated pcie root complex. Either the default pcie.0 or a pxb-pcie one. And as I mentioned in my reply to the other thread(patch #2) and commit log here, the pxb-pcie is special extra root complex in Qemu which has pcie.0 has parent bus. The above pci_device_get_iommu_bus_devfn() at present, iterate over the parent_dev if it is set and returns the parent_bus IOMMU ops even if the associated pxb-pcie bus doesn't have any IOMMU. This creates problem for a case that is described here in the cover letter here, https://lore.kernel.org/qemu-devel/20250708154055.101012-1- shameerali.kolothum.th...@huawei.com/ (Please see "Major changes from v4:" section) To address that issue, this patch introduces an new helper function to specify that the IOMMU ops are specific to the associated root complex(iommu_per_bus) and use that to return the correct IOMMU ops. Hope with that context it is clear now. Hmm, I was not questioning the context, I get what the patch is supposed to do. I was asking the logic that is unclear to me why it breaks when: !pxb-pcie->iommu_per_bus && pcie.0->iommu_per_bus Or in which case pcie.0 would be set to iommu_per_bus=true? Yes. Consider the example I gave in cover letter, -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ -device virtio-net-pci,bus=pcie.0,netdev=net0,id=virtionet.0 \ -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \ -device pcie-root-port,id=pcie.port1,chassis=2,bus=pcie.1 \ -device virtio-net-pci,bus=pcie.port1,netdev=net1,id=virtionet.1 pcie.0 is behind new SMMUv3 dev(smmuv3.1) and has iommu_per_bus set. pcie.1 has no SMMv3U and iommu_per_bus is not set for it. pcie.1 doesn't? then what is this line saying/meaning?: -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \ I read that as an smmuv3 attached to pcie.1, with an id of smmuv3.2; just as I read this config: -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ as an smmuv3 attached to pcie.0 iwth id smmuv3.1 Oops..I forgot to delete that from the config: This is what I meant, -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ -device virtio-net-pci,bus=pcie.0,netdev=net0,id=virtionet.0 \ -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device pcie-root-port,id=pcie.port1,chassis=2,bus=pcie.1 \ -device virtio-net-pci,bus=pcie.port1,netdev=net1,id=virtionet.1 \ Thanks, Shameer the dirt is in the details... thank
Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
On 7/10/25 4:11 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Nicolin Chen Sent: Wednesday, July 9, 2025 8:20 PM To: Donald Dutile Cc: Zhenzhong Duan ; qemu- de...@nongnu.org; alex.william...@redhat.com; c...@redhat.com; eric.au...@redhat.com; m...@redhat.com; jasow...@redhat.com; pet...@redhat.com; j...@nvidia.com; Shameerali Kolothum Thodi ; joao.m.mart...@oracle.com; clement.mathieu--d...@eviden.com; kevin.t...@intel.com; yi.l@intel.com; chao.p.p...@intel.com Subject: Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote: +enum { +VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */ +}; Thanks for this work. I am happy to see that we can share the common code that allocates a NESTING_PARENT in the core using this flag. Yet on ARM, a STAGE1 page table isn't always a nested S1, the hardware accelerated one. More often, it can be just a regular 1-stage translation table via emulated translation code and an emulated iotlb. Because the user-created smmuv3 started as 'accelerated smmuv3', and had been 'de-accelerated' to simply 'user created smmuv3', I'm looking for some clarification in the above statement/request. Is the above suppose to reflect that a nested IOMMU has some hw- acceleration in its Stage1 implementation? If so, then call it that: STAGE1_ACCEL. If it's suppose to represent that an IOMMU has nested/2-stage support, then the above is a valid cap; -but-, having a nested/2-stage support IOMMU doesn't necessarily mean its accelerated. Well, there are an emulated "nested" mode and an hw-accelerated "nested" mode in the smmuv3 code, so we had to choose something like "accel" over "nested". Here, on the other hand, I think the core using this CAP would unlikely care about an emulated "nested" mode in the individual vIOMMU.. So I suggested: /* hardware-accelerated nested stage-1 page table support */ VIOMMU_CAP_NESTED_S1 = BIT_ULL(0), which it should be clear IMHO. If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"? I am not sure the _S1 part makes much sense in ARM case. It doesn't matter whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP. With the new SMMUv3 dev support, the user can pretty much specify, -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested} And I think it will work with a host SMMUv3 nested configuration in all the above cases. Unless I am missing something and we need to restrict its use with stage=stage1 only. Thanks, Shameer
Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval
On 7/10/25 3:37 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Nicolin Chen Sent: Thursday, July 10, 2025 1:07 AM To: Shameerali Kolothum Thodi Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org; eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com; ddut...@redhat.com; berra...@redhat.com; imamm...@redhat.com; nath...@nvidia.com; mo...@nvidia.com; smost...@google.com; gustavo.rom...@linaro.org; m...@redhat.com; marcel.apfelb...@gmail.com; Linuxarm ; Wangzhou (B) ; jiangkunkun ; Jonathan Cameron ; zhangfei@linaro.org Subject: Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval On Wed, Jul 09, 2025 at 08:20:35AM +, Shameerali Kolothum Thodi wrote: On Tue, Jul 08, 2025 at 04:40:50PM +0100, Shameer Kolothum wrote: @@ -2909,6 +2909,19 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev, } } +/* + * When multiple PCI Express Root Buses are defined using pxb- pcie, + * the IOMMU configuration may be specific to each root bus. However, + * pxb-pcie acts as a special root complex whose parent is effectively + * the default root complex(pcie.0). Ensure that we retrieve the + * correct IOMMU ops(if any) in such cases. + */ +if (pci_bus_is_express(iommu_bus) && pci_bus_is_root(iommu_bus)) { +if (!iommu_bus->iommu_per_bus && parent_bus- iommu_per_bus) { +break; Mind elaborating why the current bus must unset iommu_per_bus while its parent sets iommu_per_bus? My understanding is that for a pxb-pcie we should set iommu_per_bus but for its parent (the default root complex) we should unset its iommu_per_bus? Well, for new arm-smmuv3 dev you need an associated pcie root complex. Either the default pcie.0 or a pxb-pcie one. And as I mentioned in my reply to the other thread(patch #2) and commit log here, the pxb-pcie is special extra root complex in Qemu which has pcie.0 has parent bus. The above pci_device_get_iommu_bus_devfn() at present, iterate over the parent_dev if it is set and returns the parent_bus IOMMU ops even if the associated pxb-pcie bus doesn't have any IOMMU. This creates problem for a case that is described here in the cover letter here, https://lore.kernel.org/qemu-devel/20250708154055.101012-1- shameerali.kolothum.th...@huawei.com/ (Please see "Major changes from v4:" section) To address that issue, this patch introduces an new helper function to specify that the IOMMU ops are specific to the associated root complex(iommu_per_bus) and use that to return the correct IOMMU ops. Hope with that context it is clear now. Hmm, I was not questioning the context, I get what the patch is supposed to do. I was asking the logic that is unclear to me why it breaks when: !pxb-pcie->iommu_per_bus && pcie.0->iommu_per_bus Or in which case pcie.0 would be set to iommu_per_bus=true? Yes. Consider the example I gave in cover letter, -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ -device virtio-net-pci,bus=pcie.0,netdev=net0,id=virtionet.0 \ -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \ -device pcie-root-port,id=pcie.port1,chassis=2,bus=pcie.1 \ -device virtio-net-pci,bus=pcie.port1,netdev=net1,id=virtionet.1 pcie.0 is behind new SMMUv3 dev(smmuv3.1) and has iommu_per_bus set. pcie.1 has no SMMv3U and iommu_per_bus is not set for it. pcie.1 doesn't? then what is this line saying/meaning?: -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \ I read that as an smmuv3 attached to pcie.1, with an id of smmuv3.2; just as I read this config: -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ as an smmuv3 attached to pcie.0 iwth id smmuv3.1 And we don't want pci_device_get_iommu_bus_devfn() to return pcie.0's IOMMU ops for virtionet.1. Hence the break. Thanks, Shameer
Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
oops, inadvertently hit send before I could add reply... apologies... reply below... On 7/10/25 1:01 PM, Donald Dutile wrote: On 7/10/25 4:11 AM, Shameerali Kolothum Thodi wrote: -Original Message- From: Nicolin Chen Sent: Wednesday, July 9, 2025 8:20 PM To: Donald Dutile Cc: Zhenzhong Duan ; qemu- de...@nongnu.org; alex.william...@redhat.com; c...@redhat.com; eric.au...@redhat.com; m...@redhat.com; jasow...@redhat.com; pet...@redhat.com; j...@nvidia.com; Shameerali Kolothum Thodi ; joao.m.mart...@oracle.com; clement.mathieu--d...@eviden.com; kevin.t...@intel.com; yi.l@intel.com; chao.p.p...@intel.com Subject: Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote: +enum { + VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */ +}; Thanks for this work. I am happy to see that we can share the common code that allocates a NESTING_PARENT in the core using this flag. Yet on ARM, a STAGE1 page table isn't always a nested S1, the hardware accelerated one. More often, it can be just a regular 1-stage translation table via emulated translation code and an emulated iotlb. Because the user-created smmuv3 started as 'accelerated smmuv3', and had been 'de-accelerated' to simply 'user created smmuv3', I'm looking for some clarification in the above statement/request. Is the above suppose to reflect that a nested IOMMU has some hw- acceleration in its Stage1 implementation? If so, then call it that: STAGE1_ACCEL. If it's suppose to represent that an IOMMU has nested/2-stage support, then the above is a valid cap; -but-, having a nested/2-stage support IOMMU doesn't necessarily mean its accelerated. Well, there are an emulated "nested" mode and an hw-accelerated "nested" mode in the smmuv3 code, so we had to choose something like "accel" over "nested". Here, on the other hand, I think the core using this CAP would unlikely care about an emulated "nested" mode in the individual vIOMMU.. So I suggested: /* hardware-accelerated nested stage-1 page table support */ VIOMMU_CAP_NESTED_S1 = BIT_ULL(0), which it should be clear IMHO. If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"? I am not sure the _S1 part makes much sense in ARM case. It doesn't matter whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP. With the new SMMUv3 dev support, the user can pretty much specify, -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested} And I think it will work with a host SMMUv3 nested configuration in all the above cases. Unless I am missing something and we need to restrict its use with stage=stage1 only. Thanks, Shameer Shameer, I didn't think of these config options this way... so is this CAP always 0 for smmuv3's associated VIOMMU? or ... should intel-iommu follow this same/smmuv3 config option(s) and avoid this VIOMMU CAP req. ? - Don
Re: [PATCH v6 01/12] hw/arm/virt-acpi-build: Don't create ITS id mappings by default
On 7/3/25 4:46 AM, Shameer Kolothum wrote: Commit d6afe18b7242 ("hw/arm/virt-acpi-build: Fix ACPI IORT and MADT tables when its=off") moved ITS group node generation under the its=on condition. However, it still creates rc_its_idmaps unconditionally, which results in duplicate ID mappings in the IORT table. Fixes:d6afe18b7242 ("hw/arm/virt-acpi-build: Fix ACPI IORT and MADT tables when its=off") Signed-off-by: Shameer Kolothum --- hw/arm/virt-acpi-build.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index cd90c47976..724fad5ffa 100644 --- a/hw/arm/virt-acpi-build.c +++ b/hw/arm/virt-acpi-build.c @@ -329,12 +329,6 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms) /* Sort the smmu idmap by input_base */ g_array_sort(rc_smmu_idmaps, iort_idmap_compare); -/* - * Knowing the ID ranges from the RC to the SMMU, it's possible to - * determine the ID ranges from RC that are directed to the ITS. - */ -create_rc_its_idmaps(rc_its_idmaps, rc_smmu_idmaps); - nb_nodes = 2; /* RC and SMMUv3 */ rc_mapping_count = rc_smmu_idmaps->len; Reviewed-by: Donald Dutile
Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
On 7/15/25 11:28 AM, Eric Auger wrote: Hi, On 7/10/25 3:22 AM, Donald Dutile wrote: On 7/9/25 3:20 PM, Nicolin Chen wrote: On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote: +enum { + VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */ +}; Thanks for this work. I am happy to see that we can share the common code that allocates a NESTING_PARENT in the core using this flag. Yet on ARM, a STAGE1 page table isn't always a nested S1, the hardware accelerated one. More often, it can be just a regular 1-stage translation table via emulated translation code and an emulated iotlb. Because the user-created smmuv3 started as 'accelerated smmuv3', and had been 'de-accelerated' to simply 'user created smmuv3', I'm looking for some clarification in the above statement/request. Is the above suppose to reflect that a nested IOMMU has some hw-acceleration in its Stage1 implementation? If so, then call it that: STAGE1_ACCEL. If it's suppose to represent that an IOMMU has nested/2-stage support, then the above is a valid cap; -but-, having a nested/2-stage support IOMMU doesn't necessarily mean its accelerated. Well, there are an emulated "nested" mode and an hw-accelerated "nested" mode in the smmuv3 code, so we had to choose something like "accel" over "nested". Here, on the other hand, I think the core using this CAP would unlikely care about an emulated "nested" mode in the individual vIOMMU.. So I suggested: /* hardware-accelerated nested stage-1 page table support */ VIOMMU_CAP_NESTED_S1 = BIT_ULL(0), which it should be clear IMHO. If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"? Thanks Nicolin If the distinction is hw-based s1 vs emulated-based s1, than I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use VIOMMU_CAP_HW_NESTED_S1 or even VIOMMU_CAP_HW_NESTED look good to me too Thanks Eric +1 to VIOMMU_CAP_HW_NESTED ; the S1 is unnecessary if nested IOMMU is the info being conveyed. - Don of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration' feature/option in the SMMU spec. Thanks, - Don