[Qemu-devel] [PATCH] pci-devfn: check that device/slot number is within range

2011-09-21 Thread Donald Dutile
Need to check that guest slot/device number is not > 31 or walk off
the devfn table when checking if a devfn is available or not in a guest.

before this fix, passing in an addr=abc  or addr=34,
can crash qemu, sometimes fail gracefully if data past end
of devfn table fails the availability test.

with this fix, get clean error:
Property 'pci-assign.addr' doesn't take value '34'

also tested when no addr= param passed for guest (pcicfg) address,
and that worked as well.

Signed-off-by: Don Dutile 

---
 hw/qdev-properties.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/hw/qdev-properties.c b/hw/qdev-properties.c
index 7ce95b6..e0e54aa 100644
--- a/hw/qdev-properties.c
+++ b/hw/qdev-properties.c
@@ -524,6 +524,8 @@ static int parse_pci_devfn(DeviceState *dev, Property 
*prop, const char *str)
 return -EINVAL;
 if (fn > 7)
 return -EINVAL;
+if (slot > 31)
+return -EINVAL;
 *ptr = slot << 3 | fn;
 return 0;
 }
-- 
1.7.1




Re: [PATCH v3] hw/arm/virt: Avoid unexpected warning from Linux guest on host with Fujitsu CPUs

2024-06-11 Thread Donald Dutile




On 6/11/24 10:05 PM, Zhenyu Zhang wrote:

Multiple warning messages and corresponding backtraces are observed when Linux
guest is booted on the host with Fujitsu CPUs. One of them is shown as below.

[0.032443] [ cut here ]
[0.032446] uart-pl011 900.pl011: ARCH_DMA_MINALIGN smaller than
CTR_EL0.CWG (128 < 256)
[0.032454] WARNING: CPU: 0 PID: 1 at arch/arm64/mm/dma-mapping.c:54
arch_setup_dma_ops+0xbc/0xcc
[0.032470] Modules linked in:
[0.032475] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-452.el9.aarch64
[0.032481] Hardware name: linux,dummy-virt (DT)
[0.032484] pstate: 6045 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[0.032490] pc : arch_setup_dma_ops+0xbc/0xcc
[0.032496] lr : arch_setup_dma_ops+0xbc/0xcc
[0.032501] sp : 80008003b860
[0.032503] x29: 80008003b860 x28:  x27: aae4b949049c
[0.032510] x26:  x25:  x24: 
[0.032517] x23: 0100 x22:  x21: 
[0.032523] x20: 0001 x19: 2f06c02ea400 x18: 
[0.032529] x17: 208a5f76 x16: 6589dbcb x15: aae4ba071c89
[0.032535] x14:  x13: aae4ba071c84 x12: 455f525443206e61
[0.032541] x11: 68742072656c6c61 x10: 0029 x9 : aae4b7d21da4
[0.032547] x8 : 0029 x7 : 4c414e494d5f414d x6 : 0029
[0.032553] x5 : 000f x4 : aae4b9617a00 x3 : 0001
[0.032558] x2 :  x1 :  x0 : 2f06c029be40
[0.032564] Call trace:
[0.032566]  arch_setup_dma_ops+0xbc/0xcc
[0.032572]  of_dma_configure_id+0x138/0x300
[0.032591]  amba_dma_configure+0x34/0xc0
[0.032600]  really_probe+0x78/0x3dc
[0.032614]  __driver_probe_device+0x108/0x160
[0.032619]  driver_probe_device+0x44/0x114
[0.032624]  __device_attach_driver+0xb8/0x14c
[0.032629]  bus_for_each_drv+0x88/0xe4
[0.032634]  __device_attach+0xb0/0x1e0
[0.032638]  device_initial_probe+0x18/0x20
[0.032643]  bus_probe_device+0xa8/0xb0
[0.032648]  device_add+0x4b4/0x6c0
[0.032652]  amba_device_try_add.part.0+0x48/0x360
[0.032657]  amba_device_add+0x104/0x144
[0.032662]  of_amba_device_create.isra.0+0x100/0x1c4
[0.032666]  of_platform_bus_create+0x294/0x35c
[0.032669]  of_platform_populate+0x5c/0x150
[0.032672]  of_platform_default_populate_init+0xd0/0xec
[0.032697]  do_one_initcall+0x4c/0x2e0
[0.032701]  do_initcalls+0x100/0x13c
[0.032707]  kernel_init_freeable+0x1c8/0x21c
[0.032712]  kernel_init+0x28/0x140
[0.032731]  ret_from_fork+0x10/0x20
[0.032735] ---[ end trace  ]---

In Linux, a check is applied to every device which is exposed through
device-tree node. The warning message is raised when the device isn't
DMA coherent and the cache line size is larger than ARCH_DMA_MINALIGN
(128 bytes). The cache line is sorted from CTR_EL0[CWG], which corresponds
to 256 bytes on the guest CPUs. The DMA coherent capability is claimed
through 'dma-coherent' in their device-tree nodes or parent nodes.

Fix the issue by adding 'dma-coherent' property to the device-tree root
node, meaning all devices are capable of DMA coherent by default.

Signed-off-by: Zhenyu Zhang 
---
v3: Add comments explaining why we add 'dma-coherent' property (Peter)
---
  hw/arm/virt.c | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 3c93c0c0a6..3cefac6d43 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -271,6 +271,17 @@ static void create_fdt(VirtMachineState *vms)
  qemu_fdt_setprop_cell(fdt, "/", "#size-cells", 0x2);
  qemu_fdt_setprop_string(fdt, "/", "model", "linux,dummy-virt");
  
+/*

+ * For QEMU, all DMA is coherent. Advertising this in the root node
+ * has two benefits:
+ *
+ * - It avoids potential bugs where we forget to mark a DMA
+ *   capable device as being dma-coherent
+ * - It avoids spurious warnings from the Linux kernel about
+ *   devices which can't do DMA at all
+ */
+qemu_fdt_setprop(fdt, "/", "dma-coherent", NULL, 0);
+
  /* /chosen must exist for load_dtb to fill in necessary properties later 
*/
  qemu_fdt_add_subnode(fdt, "/chosen");
  if (vms->dtb_randomness) {


+1 to Peter's suggested comment, otherwise, unless privy to this thread,
one would wonder how/why.

Reviewed-by: Donald Dutile 

Re: nested-smmuv3 topic for QEMU/libvirt, Nov 2024

2024-11-07 Thread Donald Dutile




On 11/7/24 3:31 PM, Nicolin Chen wrote:

Hi Eric,

On Thu, Nov 07, 2024 at 12:11:05PM +0100, Eric Auger wrote:

On 11/1/24 05:09, Nicolin Chen wrote:

Hi,

This is a continued discussion following previous month's:
https://lore.kernel.org/qemu-devel/Zvr%2Fbf7KgLN1cjOl@Asurada-Nvidia/

Kernel changes are getting closer to merge, as Jason's planning to
take vIOMMU series and smmuv3_nesting series into the iommufd tree:
https://lore.kernel.org/all/cover.1730313494.git.nicol...@nvidia.com/
https://lore.kernel.org/all/cover.1730313494.git.nicol...@nvidia.com/
https://lore.kernel.org/all/0-v4-9e99b76f3518+3a8-smmuv3_nesting_...@nvidia.com/

So, I think it's probably a good time to align with each others and
talk about kicking off some QEMU upstream work in the month ahead.

@Shameer,
Do you have some update on the pluggable smmuv3 module?

Updates on my side:
1) I have kept uAPI updated to the latest version and verified too.
There should be some polishing changes depending on how the basic
nesting infrastructure would look like from Intel/Duan's work.
2) I got some help from NVIDIA folks for the libvirt task. And they
have done some drafting and are now verifying the PCI topology
with "iommu=none".

Once the pluggable smmuv3 module is ready to test, we will make some
change to libvirt for that and drop the auto-assigning patches from
the VIRT code, so as to converge for a libvirt+QEMU test.

FWIW, Robin requested a different solution for MSI mapping [1], v.s.
the RMR one that we have been using since Eric's work. I drafted a
few VFIO/IOMMUFD patches for that, yet paused for getting the vIOMMU
series merged to this kernel cycle. I plan to continue in Nov/Dec.
So, for the near term we will continue with the RMR solution, until
we have something solid later.

[1] https://lore.kernel.org/linux-iommu/ZrVN05VylFq8lK4q@Asurada-Nvidia/


At Red Hat we may find some cycles to resume working on the QEMU
integration. Please can you sketch some tasks we could carry out in
coordination with you and Shameer? Adding Don in the loop.


That is great!

So, given that Shameer is working on pluggable module part and we
have folks working on libvirt. I think the only big thing here is
the SMMUv3 series itself. Please refer to the changes in the link:
  - cover-letter: Add HW accelerated nesting support for arm SMMUv3
  - https://github.com/nicolinc/qemu/commits/wip/for_smmuv3_nesting-v4/

I expect the IOMMU_HWPT_ALLOC (backend APIs) will go with Intel's
series once their current "emulated devices" one gets merged. And
I think I can prepare IOMMU_VIOMMU_ALLOC patches for backend APIs
aligning with HWPT's.

That being said, one thing I am not sure is how we should bridge
between these backend APIs and a virtual IOMMUs (vSMMU/intel). I
think it'd be better for you and Red Hat to provide some insight,
if calling the backend APIs directly from a viommu module isn't a
preferable way.

We also need your comments on vSMMU module patches that are still
roughly a draft requiring a rework at some small details I think.
So, if your (and Don's) bandwidth allows, perhaps you could take
over the vSMMU patches? Otherwise, we can also share the task for
reworking.

Thanks!
Nicolin


Nicolin,
Hi!
Sounds like a plan.  Glad to join the effort.
- Don




Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device

2024-11-27 Thread Donald Dutile




On 11/27/24 11:00 AM, Jason Gunthorpe wrote:

On Wed, Nov 27, 2024 at 10:21:24AM +, Shameerali Kolothum Thodi wrote:

For SMMUv3, NVIDIA-specific vCMDQ, it needs a parameter to state that
specifically,
since I'm concluding from reading the SMMUv3 version G.a spec, that
ECMDQ was added
to be able to assign an ECMDQ to a VM,


Not sure the intention of ECMDQ as per that specification is to assign
it to a VM. I think the main idea behind it is to have one Command Queue
per host CPU to eliminate lock contention while submitting commands
to SMMU.


Right


AFAIK it is not safe to assign one of the ECMDQ to guest yet. I think there is 
no
way you can associate a VMID with ECMDQ. So there is no plan to
support ARM ECMDQ now.


Yep


'Yet' ...
The association would be done via the VMM -- no different then what associates
an assigned device to a VM today -- no hw-level (VM-)ID needed; a matter of 
exposing
it to the VM, or not; or mapping the (virtual) CMDQ to the mapped/associated 
ECMDQ.
They are purposedly mapped 64K apart from each other, enabling page-level 
protection,
which I doubt is a per-CPU req for lock contention avoidance (large-cache-block
spaced would be sufficient, even 4k; it's 64k spaced btwn ECMDQ regs .. the 
largest
ARM page size.

Summary: let's not assume this can't happen, and the chosen cmdline prevents it.


NVIDIA VCMDQ is a completely vendor specific one. Perhaps ARM may come
up with an assignable CMDQ in future though.


Yes, it is easy to imagine an ECMDQ extension that provides the same HW
features that VCMDQ has in future. I hope ARM will develop one.


Right, so how to know what op is being "accel"'d wrt smmuv3.


... and all needs to be per-instance 
... libvirt  (or any other VMM orchestrator) will need to determine
compatibility for
  live migration. e.g., can one live migrate an accel=nv-vcmdq-based VM to
a host with
  accel=ecmdq support?  only nv-vcmdq?  what if there are version diffs of
nv-vcmdq over time?
  -- apologies, but I don't know the minute details of nv-vcmdq to
determine if that's unlikely or not.


Yes. This require more thought. But our first aim is get the basic smmuv3-accel
support.


Yeah, there is no live migration support yet in the SMMU qmeu driver,
AFAIK?

When it gets done the supported options will have to be considered

Jason






Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device

2024-11-27 Thread Donald Dutile




On 11/27/24 5:21 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Donald Dutile 
Sent: Tuesday, November 26, 2024 6:29 PM
To: Nicolin Chen ; Eric Auger

Cc: Shameerali Kolothum Thodi
; qemu-...@nongnu.org;
qemu-devel@nongnu.org; peter.mayd...@linaro.org; j...@nvidia.com;
Linuxarm ; Wangzhou (B)
; jiangkunkun ;
Jonathan Cameron ;
zhangfei@linaro.org
Subject: Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for
SMMUv3 Nested device



On 11/13/24 1:05 PM, Nicolin Chen wrote:

Hi Eric,

On Wed, Nov 13, 2024 at 06:12:15PM +0100, Eric Auger wrote:

On 11/8/24 13:52, Shameer Kolothum wrote:

@@ -181,6 +181,7 @@ static const MemMapEntry base_memmap[] = {
   [VIRT_PVTIME] = { 0x090a, 0x0001 },
   [VIRT_SECURE_GPIO] ={ 0x090b, 0x1000 },
   [VIRT_MMIO] =   { 0x0a00, 0x0200 },
+[VIRT_SMMU_NESTED] ={ 0x0b00, 0x0100 },



I agree with Mostafa that the _NESTED terminology may not be the best
choice.
The motivation behind that multi-instance attempt, as introduced in
https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/
was:
- SMMUs with different feature bits
- support of VCMDQ HW extension for SMMU CMDQ
- need for separate S1 invalidation paths

If I understand correctly this is mostly wanted for VCMDQ handling? if
this is correct we may indicate that somehow in the terminology.

If I understand correctly VCMDQ terminology is NVidia specific while
ECMDQ is the baseline (?).


VCMDQ makes a multi-vSMMU-instance design a hard requirement, yet
the point (3) for separate invalidation paths also matters. Jason
suggested VMM in base case to create multi vSMMU instances as the
kernel doc mentioned here:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-

next.git/tree/Documentation/userspace-api/iommufd.rst#n84


W.r.t naming, maybe something related to "hardware-accelerated"?


Given that 'accel' has been used for hw-acceleration elsewhere, that seems
like a reasonable 'mode'.
But, it needs a paramater to state was is being accelerated.
i.e., the more global 'accel=kvm' has 'kvm'.


I was thinking more like calling this hw accelerated nested SMMUv3 emulation
as 'smmuv3-accel'.  This avoids confusion with the already existing
'iommu=smmuv3' that also has a nested emulation support.

ie,
-device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \


I -think- you are saying below, that we have to think a bit more about this
device tagging.  I'm thinking more like
 - device arm-smmuv3,accel=,id=smmu1,bus=pcie.1 \



For SMMUv3, NVIDIA-specific vCMDQ, it needs a parameter to state that
specifically,
since I'm concluding from reading the SMMUv3 version G.a spec, that
ECMDQ was added
to be able to assign an ECMDQ to a VM,


Not sure the intention of ECMDQ as per that specification is to assign
it to a VM. I think the main idea behind it is to have one Command Queue
per host CPU to eliminate lock contention while submitting commands
to SMMU.

AFAIK it is not safe to assign one of the ECMDQ to guest yet. I think there is 
no
way you can associate a VMID with ECMDQ. So there is no plan to
support ARM ECMDQ now.

NVIDIA VCMDQ is a completely vendor specific one. Perhaps ARM may come
up with an assignable CMDQ in future though.

  and let the VM do CMDQ driven

invalidations via
a similar mechanism as assigned PCI-device mmio space in a VM.
So, how should the QEMU invocation select what parts to 'accel' in the
vSMMUv3 given
to the VM?  ... and given the history of hw-based, virt-acceleration, I can
only guess
more SMMUv3 accel tweaks will be found/desired/implemented.

So, given there is an NVIDIA-specific/like ECMDQ, but different, the accel
parameter
chosen has to consider 'name-space collision', i.e., accel=nv-vcmdq  and
accel=ecmdq,
unless sw can be made to smartly probe and determine the underlying
diffs, and have
equivalent functionality, in which case, a simpler 'accel=vcmdq' could be
used.



Yep. Probably we could abstract that from the user and handle it within
Qemu when the kernel reports the capability based on physical SMMUv3.


Finally, wrt libvirt, how does it know/tell what can and should be used?
For ECMDQ, something under sysfs for an SMMUv3 could expose its
presence/capability/availability
(tag for use/alloc'd for a VM), or an ioctl/cdev i/f to the SMMUv3.
But how does one know today that there's NVIDIA-vCMDQ support on its
SMMUv3? -- is it
exposed in sysfs, ioctl, cdev?


I think the capability will be reported through a IOCTL.  Nicolin ?


... and all needs to be per-instance 
... libvirt  (or any other VMM orchestrator) will need to determine
compatibility for
  live migration. e.g., can one live migrate an accel=nv-vcmdq-based VM to
a host with
  accel=ecmdq support?  only nv-vcmdq?  what if there are version diffs of
nv-vcmdq over time?
  -- apo

Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device

2024-11-26 Thread Donald Dutile




On 11/13/24 1:05 PM, Nicolin Chen wrote:

Hi Eric,

On Wed, Nov 13, 2024 at 06:12:15PM +0100, Eric Auger wrote:

On 11/8/24 13:52, Shameer Kolothum wrote:

@@ -181,6 +181,7 @@ static const MemMapEntry base_memmap[] = {
  [VIRT_PVTIME] = { 0x090a, 0x0001 },
  [VIRT_SECURE_GPIO] ={ 0x090b, 0x1000 },
  [VIRT_MMIO] =   { 0x0a00, 0x0200 },
+[VIRT_SMMU_NESTED] ={ 0x0b00, 0x0100 },



I agree with Mostafa that the _NESTED terminology may not be the best
choice.
The motivation behind that multi-instance attempt, as introduced in
https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/
was:
- SMMUs with different feature bits
- support of VCMDQ HW extension for SMMU CMDQ
- need for separate S1 invalidation paths

If I understand correctly this is mostly wanted for VCMDQ handling? if
this is correct we may indicate that somehow in the terminology.

If I understand correctly VCMDQ terminology is NVidia specific while
ECMDQ is the baseline (?).


VCMDQ makes a multi-vSMMU-instance design a hard requirement, yet
the point (3) for separate invalidation paths also matters. Jason
suggested VMM in base case to create multi vSMMU instances as the
kernel doc mentioned here:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/Documentation/userspace-api/iommufd.rst#n84

W.r.t naming, maybe something related to "hardware-accelerated"?


Given that 'accel' has been used for hw-acceleration elsewhere, that seems like 
a reasonable 'mode'.
But, it needs a paramater to state was is being accelerated.
i.e., the more global 'accel=kvm' has 'kvm'.

For SMMUv3, NVIDIA-specific vCMDQ, it needs a parameter to state that 
specifically,
since I'm concluding from reading the SMMUv3 version G.a spec, that ECMDQ was 
added
to be able to assign an ECMDQ to a VM, and let the VM do CMDQ driven 
invalidations via
a similar mechanism as assigned PCI-device mmio space in a VM.
So, how should the QEMU invocation select what parts to 'accel' in the vSMMUv3 
given
to the VM?  ... and given the history of hw-based, virt-acceleration, I can 
only guess
more SMMUv3 accel tweaks will be found/desired/implemented.

So, given there is an NVIDIA-specific/like ECMDQ, but different, the accel 
parameter
chosen has to consider 'name-space collision', i.e., accel=nv-vcmdq  and 
accel=ecmdq,
unless sw can be made to smartly probe and determine the underlying diffs, and 
have
equivalent functionality, in which case, a simpler 'accel=vcmdq' could be used.

Finally, wrt libvirt, how does it know/tell what can and should be used?
For ECMDQ, something under sysfs for an SMMUv3 could expose its 
presence/capability/availability
(tag for use/alloc'd for a VM), or an ioctl/cdev i/f to the SMMUv3.

But how does one know today that there's NVIDIA-vCMDQ support on its SMMUv3? -- 
is it
exposed in sysfs, ioctl, cdev?
... and all needs to be per-instance 
... libvirt  (or any other VMM orchestrator) will need to determine 
compatibility for
live migration. e.g., can one live migrate an accel=nv-vcmdq-based VM to a 
host with
accel=ecmdq support?  only nv-vcmdq?  what if there are version diffs of 
nv-vcmdq over time?
-- apologies, but I don't know the minute details of nv-vcmdq to determine 
if that's unlikely or not.

Once the qemu-smmuv3-api is defined, with the recognition of what libvirt (or 
any other VMM) needs to probe/check/use for hw-accelerated features,
I think it'll be more straight-fwd to implement, and (clearly) understand from 
a qemu command line. :)

Thanks,
- Don


Thanks
Nicolin






Re: [RFC PATCH 2/5] hw/arm/smmuv3: Add initial support for SMMUv3 Nested device

2024-12-02 Thread Donald Dutile




On 11/28/24 7:54 AM, Jason Gunthorpe wrote:

On Wed, Nov 27, 2024 at 08:44:47PM -0800, Nicolin Chen wrote:

On Wed, Nov 27, 2024 at 11:29:06PM -0500, Donald Dutile wrote:

On 11/27/24 5:21 AM, Shameerali Kolothum Thodi wrote:

W.r.t naming, maybe something related to "hardware-accelerated"?


Given that 'accel' has been used for hw-acceleration elsewhere, that seems
like a reasonable 'mode'.
But, it needs a paramater to state was is being accelerated.
i.e., the more global 'accel=kvm' has 'kvm'.


I was thinking more like calling this hw accelerated nested SMMUv3 emulation
as 'smmuv3-accel'.  This avoids confusion with the already existing
'iommu=smmuv3' that also has a nested emulation support.

ie,
-device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \


..

I -think- you are saying below, that we have to think a bit more about this
device tagging.  I'm thinking more like
  - device arm-smmuv3,accel=,id=smmu1,bus=pcie.1 \


I wonder if we really need a "vcmdq" enabling/disabling option?

Jason's suggested approach for a vIOMMU allocation is to retry-
on-failure, so my draft patches allocate a TEGRA241_CMDQV type
of vIOMMU first, and then fall back to a regular SMMUV3 type if
it fails. So, a host that doesn't have a VCMDQ capability could
still work with the fallback/default pathway.


It needs to be configurable so the VM can be configured in a
consistent way across nodes


Yes.
To expound further, one wants to be able to define an 'acceptable'
VM criteria, so libvirt (or OpenStack?) can find and generate the list
of 'acceptable nodes', priori typically, that can be a match for the
acceptance criteria.
Conversely, if one specifies a set of systems that one wants to be able
to migrate across, then libvirt needs to find and select/set the 
features|attributes
that enable the VM to migrate in a compatible way.


autodetection of host features is nice for experimenting but scale
deployments should precisely specify every detail about the VM and not
rely on host detection. Otherwise the VM instance type will be ill
defined..

Jason






Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested SMMUv3

2024-12-17 Thread Donald Dutile




On 12/13/24 8:19 AM, Daniel P. Berrangé wrote:

On Fri, Dec 13, 2024 at 08:46:42AM -0400, Jason Gunthorpe wrote:

On Fri, Dec 13, 2024 at 12:00:43PM +, Daniel P. Berrangé wrote:

On Fri, Nov 08, 2024 at 12:52:37PM +, Shameer Kolothum via wrote:

Hi,

This series adds initial support for a user-creatable "arm-smmuv3-nested"
device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine
and cannot support multiple SMMUv3s.

In order to support vfio-pci dev assignment with vSMMUv3, the physical
SMMUv3 has to be configured in nested mode. Having a pluggable
"arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests
running on a host with multiple physical SMMUv3s. A few benefits of doing
this are,


I'm not very familiar with arm, but from this description I'm not
really seeing how "nesting" is involved here. You're only talking
about the host and 1 L1 guest, no L2 guest.


nesting is the term the iommu side is using to refer to the 2
dimensional paging, ie a guest page table on top of a hypervisor page
table.

Nothing to do with vm nesting.


Ok, that naming is destined to cause confusion for many, given the
commonly understood use of 'nesting' in the context of VMs...

  

Also what is the relation between the physical SMMUv3 and the guest
SMMUv3 that's referenced ? Is this in fact some form of host device
passthrough rather than nesting ?


It is an acceeleration feature, the iommu HW does more work instead of
the software emulating things. Similar to how the 2d paging option in
KVM is an acceleration feature.

All of the iommu series on vfio are creating paravirtualized iommu
models inside the VM. They access various levels of HW acceleration to
speed up the paravirtualization.


... describing it as a HW accelerated iommu makes it significantly clearer
to me what this proposal is about. Perhaps the device is better named as
"arm-smmuv3-accel" ?


I'm having deja-vu! ;-)
Thanks for echo-ing my earlier statements in this patch series about the use of 
'nested'.
and the better use of 'accel' in these circumstances.
Even 'accel' on an 'arm-smmuv3' is a bit of a hammer, as there can be multiple 
accel's features
&/or implementations... I would like to see the 'accel' as a parameter to 
'arm-smmuv3', and not
a complete name-space onto itself, so we can do things like 'accel=cmdvq', 
accel='2-level', ...

and for libvirt's sanity, a way to get those hw features from sysfs for
(possible) migration-compatibility testing.



With regards,
Daniel





Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested SMMUv3

2025-01-10 Thread Donald Dutile

Nicolin,
Hi!


On 1/8/25 11:45 PM, Nicolin Chen wrote:

On Mon, Dec 16, 2024 at 10:01:29AM +, Shameerali Kolothum Thodi wrote:

And patches prior to this commit adds that support:
4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
SMMUv3")

Nicolin is soon going to send out those for review. Or I can include
those in this series so that it gives a complete picture. Nicolin?


Just found that I forgot to reply this one...sorry

I asked Don/Eric to take over that vSMMU series:
https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
(The majority of my effort has been still on the kernel side:
  previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)

Don/Eric, is there any update from your side?


Apologies for delayed response, been at customer site, and haven't been keeping 
up w/biz email.
Eric is probably waiting for me to get back and chat as well.
Will look to reply early next week.
- Don


I think it's also a good time to align with each other so we
can take our next step in the new year :)

Thanks
Nicolin






Re: [RFC PATCH v2 13/20] hw/arm/smmuv3-accel: Introduce helpers to batch and issue cache invalidations

2025-03-19 Thread Donald Dutile




On 3/19/25 5:48 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Donald Dutile 
Sent: Wednesday, March 19, 2025 1:31 AM
To: Shameerali Kolothum Thodi
; qemu-...@nongnu.org;
qemu-devel@nongnu.org
Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [RFC PATCH v2 13/20] hw/arm/smmuv3-accel: Introduce helpers
to batch and issue cache invalidations

Shameer,

Hi,


On 3/11/25 10:10 AM, Shameer Kolothum wrote:

From: Nicolin Chen 

Inroduce an SMMUCommandBatch and some helpers to batch and issue

the
 Introduce

commands.  Currently separate out TLBI commands and device cache
commands to avoid some errata on certain versions of SMMUs. Later it
should check IIDR register to detect if underlying SMMU hw has such an

erratum.
Where is all this info about 'certain versions of SMMUs' and 'check IIDR
register' has something to do with 'underlying SMMU hw such an erratum',
-- which IIDR (& bits)? or are we talking about rsvd SMMU_IDR<> registers?


I guess the batching has constraints on some platforms, IIRC, this was discussed
somewhere in a kernel thread.

Nicolin, could you please provide some background on this.


A lore link if it's discussed upstream, thanks.



And can't these helpers be used for emulated smmuv3 as well as
accelerated?


Could be I guess. But no benefit in terms of performance. May be will make
code look nicer. I will take a look if not much of changes in the emulated path.


Thanks for looking into it.  The push is to use common code path(s) to invoke 
(or not)
an accel path.


Thanks,
Shameer







Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device

2025-03-18 Thread Donald Dutile




On 3/18/25 3:13 PM, Nicolin Chen wrote:

On Tue, Mar 18, 2025 at 07:31:36PM +0100, Eric Auger wrote:

On 3/17/25 9:19 PM, Nicolin Chen wrote:

On Mon, Mar 17, 2025 at 04:24:53PM -0300, Jason Gunthorpe wrote:

On Mon, Mar 17, 2025 at 12:10:19PM -0700, Nicolin Chen wrote:

Another question: how does an emulated device work with a vSMMUv3?
I could imagine that all the accel steps would be bypassed since
!sdev->idev. Yet, the emulated iotlb should cache its translation
so we will need to flush the iotlb, which will increase complexity
as the TLBI command dispatching function will need to be aware what
ASID is for emulated device and what is for vfio device..

I think you should block it. We already expect different vSMMU's
depending on the physical SMMU under the PCI device, it makes sense
that a SW VFIO device would have it's own, non-accelerated, vSMMU
model in the guest.

Yea, I agree and it'd be cleaner for an implementation separating
them.

In my mind, the general idea of "accel=on" is also to keep things
in a more efficient way: passthrough devices go to HW-accelerated
vSMMUs (separated PCIE buses), while emulated ones go to a vSMMU-
bypassed (PCIE0).



Originally a specific SMMU device was needed to opt in for MSI reserved
region ACPI IORT description which are not needed if you don't rely on
S1+S2. However if we don't rely on this trick this was not even needed
with legacy integration
(https://patchwork.kernel.org/project/qemu-devel/cover/20180921081819.9203-1-eric.au...@redhat.com/).

Nevertheless I don't think anything prevents the acceleration granted
device from also working with virtio/vhost devices for instance unless
you unplug the existing infra. The translation and invalidation just
should use different control paths (explicit translation requests,
invalidation notifications towards vhost, ...).


smmuv3_translate() is per sdev, so it's easy.

Invalidation is done via commands, which could be tricky:
a) Broadcast command
b) ASID validation -- we'll need to keep track of a list of ASIDs
for vfio device to compare the ASID in each per-ASID command,
potentially by trapping all CFGI_CD(_ALL) commands? Note that
each vfio device may have multiple ASIDs (for multiple CDs).
Either a or b above will have some validation efficiency impact.


Again, what does legitimate to have different qemu devices for the same
IP? I understand that it simplifies the implementation but I am not sure
this is a good reason. Nevertheless it worth challenging. What is the
plan for intel iommu? Will we have 2 devices, the legacy device and one
for nested?


Hmm, it seems that there are two different topics:
1. Use one SMMU device model (source code file; "iommu=" string)
for both an emulated vSMMU and an HW-accelerated vSMMU.
2. Allow one vSMMU instance to work with both an emulated device
and a passthrough device.
And I get that you want both 1 and 2.

I'm totally okay with 1, yet see no compelling benefit from 2 for
the increased complexity in the invalidation routine.

And another question about the mixed device attachment. Let's say
we have in the host:
   VFIO passthrough dev0 -> pSMMU0
   VFIO passthrough dev1 -> pSMMU1
Should we allow emulated devices to be flexibly plugged?
   dev0 -> vSMMU0 /* Hard requirement */
   dev1 -> vSMMU1 /* Hard requirement */
   emu0 -> vSMMU0 /* Soft requirement; can be vSMMU1 also */
   emu1 -> vSMMU1 /* Soft requirement; can be vSMMU0 also */

Thanks
Nicolin


I agree w/Jason & Nicolin: different vSMMUs for pass-through devices than emulated, 
& vice-versa.
Not mixing... because... of the next agreement:

I agree with Eric that 'accel' isn't needed -- this should be ascertained from 
the pSMMU that a physical device is attached to.
Now... how does vfio(?; why not qemu?) layer determine that? -- where are 
SMMUv3 'accel' features exposed either: a) in the device struct (for the 
smmuv3) or (b) somewhere under sysfs? ... I couldn't find anything under either 
on my g-h system, but would appreciate a ptr if there is.
and like Eric, although 'accel' is better than the original 'nested', it's 
non-obvious what accel feature(s) are being turned on, or not.
In fact, if broken accel hw occurs ('if' -> 'when'), how should it be turned 
off? ... if info in the kernel, a kernel boot-param will be needed;
if in sysfs, a write to 0 an enable(disable) it maybe an alternative as well.
Bottom line: we need a way to (a) ascertain the accel feature (b) a way to 
disable it when it is broken,
so qemu's smmuv3 spec will 'just work'.
[This may also help when migrating from a machine that has accel working to one 
that does not.[

... and when an emulated device is assigned a vSMMU, there are no accel 
features ... unless we have tunables like batch iotlb invalidation for perf 
reasons, which can be viewed as an 'accel' option.




Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device

2025-03-18 Thread Donald Dutile




On 3/17/25 3:24 PM, Jason Gunthorpe wrote:

On Mon, Mar 17, 2025 at 12:10:19PM -0700, Nicolin Chen wrote:

Another question: how does an emulated device work with a vSMMUv3?
I could imagine that all the accel steps would be bypassed since
!sdev->idev. Yet, the emulated iotlb should cache its translation
so we will need to flush the iotlb, which will increase complexity
as the TLBI command dispatching function will need to be aware what
ASID is for emulated device and what is for vfio device..


I think you should block it. We already expect different vSMMU's

... and when you say 'block', you mean qemu prints out a helpful message
like "Mixing emulate/virtual devices and physical devices on a single SMMUv3 is 
not allowed.
  Specify separate smmuv3 objects for each type of device; multiple smmuv3 
objects may
  be required for each physical device if they are attached to different 
smmuv3's in the host system."

Or would that be an allowed qemu machine definition, but the 'block' would be a 
warning like:
 "Mixing emulated/virtual devices and physical devices on a single SMMUv3 is 
not recommended for
  performance reasons.  To yield optimal performance, place physical devices on 
separate SMMUv3 objects
  than emulated/virtual device SMMUv3 objects."
... and in this case, the physical devices would not use the accel features of 
an smmuv3, but still be 'functional'.
This may be desired for a machine definition that wants to be used on different 
hosts that may not have the
(same) accel feature(s).




depending on the physical SMMU under the PCI device, it makes sense
that a SW VFIO device would have it's own, non-accelerated, vSMMU
model in the guest.

Jason






Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus

2025-03-18 Thread Donald Dutile

Shameer,

Hi!

On 3/11/25 10:10 AM, Shameer Kolothum wrote:

User must associate a pxb-pcie root bus to smmuv3-accel
and that is set as the primary-bus for the smmu dev.

Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmuv3-accel.c | 19 +++
  1 file changed, 19 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c327661636..1471b65374 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -9,6 +9,21 @@
  #include "qemu/osdep.h"
  
  #include "hw/arm/smmuv3-accel.h"

+#include "hw/pci/pci_bridge.h"
+
+static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque)
+{
+DeviceState *d = opaque;
+
+if (object_dynamic_cast(obj, "pxb-pcie-bus")) {
+PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus;
+if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus->name)) {
+object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus),
+ &error_abort);
+}
+}
+return 0;
+}
  
  static void smmu_accel_realize(DeviceState *d, Error **errp)

  {
@@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error **errp)
  SysBusDevice *dev = SYS_BUS_DEVICE(d);
  Error *local_err = NULL;
  
+object_child_foreach_recursive(object_get_root(),

+   smmuv3_accel_pxb_pcie_bus, d);
+
  object_property_set_bool(OBJECT(dev), "accel", true, &error_abort);
  c->parent_realize(d, &local_err);
  if (local_err) {
@@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass *klass, void 
*data)
  device_class_set_parent_realize(dc, smmu_accel_realize,
  &c->parent_realize);
  dc->hotpluggable = false;
+dc->bus_type = TYPE_PCIE_BUS;
  }
  
  static const TypeInfo smmuv3_accel_type_info = {


I am not seeing the need for a pxb-pcie bus(switch) introduced for each 'accel'.
Isn't the IORT able to define different SMMUs for different RIDs?   if so, 
itsn't that sufficient
to associate (define) an SMMU<->RID association without introducing a pxb-pcie?
and again, I'm not sure how that improves/enables the device<->SMMU 
associativity?

Feel free to enlighten me where I may have mis-read/interpreted the IORT & 
SMMUv3 specs.

Thanks,
- Don




Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device

2025-03-18 Thread Donald Dutile

Jason,
Hey!

On 3/18/25 8:23 PM, Jason Gunthorpe wrote:

On Tue, Mar 18, 2025 at 05:22:51PM -0400, Donald Dutile wrote:


I agree with Eric that 'accel' isn't needed -- this should be
ascertained from the pSMMU that a physical device is attached to.


I seem to remember the point was made that we don't actually know if
accel is possible, or desired, especially in the case of hotplug.


In the case of hw-passthrough hot-plug, what isn't known?:
a) domain:b:d.f is known
b) thus its hierarchy and SMMUv3 association in the host is known
c) thus, if the (accel) features of the SMMUv3 were exposed (known),
   then the proper setup (separate SMMUv3 vs system-wide-emulated SMMUv3;
   association of (allocated/configured) vSMMUv3 to pSMMUv3 would be known/made

What else is missing?


The accelerated mode has a number of limitations that the software
mode does not have. I think it does make sense that the user would
deliberately choose to use a more restrictive operating mode and then
would have to meet the requirements - eg by creating the required
number and configuration of vSMMUs.


At a qemu-cmd level, the right number & config of smmuv3's, but
libvirt, if it had the above info, could auto-generate the right number
of smmuv3's (stages, accel-features, etc.) ... just as it does today in
creating the right number of pcie bus's, RPs, etc. from simple(r)
device specs into more complete, qemu configs.


Now... how does vfio(?; why not qemu?) layer determine that? --
where are SMMUv3 'accel' features exposed either: a) in the device
struct (for the smmuv3) or (b) somewhere under sysfs? ... I couldn't
find anything under either on my g-h system, but would appreciate a
ptr if there is.


I think it is not discoverable yet other thatn through
try-and-fail. Discoverability would probably be some bits in an
iommufd GET_INFO ioctl or something like that.


I don't see how iommufd would 'get-info' the needed info any better
than any other interface/subsystem.  ...


and like Eric, although 'accel' is better than the
original 'nested', it's non-obvious what accel feature(s) are being
turned on, or not.


There are really only one accel feature - direct HW usage of the IO
Page table in the guest (no shadowing).

A secondary addon would be direct HW usage of an invalidation queue in
the guest.


and, if architected correctly, even in (device-specific) sw-provided tables,
it could be 'formatted' in a way that it was discoverable by the appropriate 
layers
(libvirt, qemu).
Once discoverable, this whole separate accel device -- which is really an
attribute of an SMMUv3 -- can be generalized, and reduced, to a much
smaller, simpler, sw footprint, with the concept of callbacks (as the series
uses) to enable hw accelerators to perform the shadow-ops that fully-emulated
smmuv3 would have to do.
 

kernel boot-param will be needed; if in sysfs, a write to 0 an
enable(disable) it maybe an alternative as well.  Bottom line: we
need a way to (a) ascertain the accel feature (b) a way to disable
it when it is broken, so qemu's smmuv3 spec will 'just work'.


You'd turned it off by not asking qemu to use it, that is sort of the
reasoning behind the command line opt in for accel or not.

It would make machine-level definitions far more portable if the
working/non-working, and the one-accel, or two-accel, or three-accel, or ...
features were dynamically determined  vs a static (qemu) machine config, that 
would have
to be manipulated each time it ran on a different machine.

e.g., cluster sw scans servers for machines with device-X.
  create VMs, assigning some/all of device-X to a VM via its own smmuv3. 
done.
  Now, if the smmuv3 features were exposed all the way up to userspace,
  then one could argue the cluster sw could scan for those features and add 
it
  to the accel=x,y,z option of the smmuv3 associated with an assigned 
device.
  potato/po-tah-toe cluster sw or libvirt or qemu or  
scans/reads
   ... discoverability of the features has to be done by
  (a) a computer, or (b) an error-prone human.
 ... all that AI gone to waste ... ;-)

- Don



Jason






Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device

2025-03-20 Thread Donald Dutile




On 3/19/25 2:09 PM, Eric Auger wrote:

Hi Nicolin,


On 3/19/25 6:14 PM, Nicolin Chen wrote:

On Wed, Mar 19, 2025 at 05:45:51PM +0100, Eric Auger wrote:



On 3/17/25 8:10 PM, Nicolin Chen wrote:

On Mon, Mar 17, 2025 at 07:07:52PM +0100, Eric Auger wrote:

On 3/17/25 6:54 PM, Nicolin Chen wrote:

On Wed, Mar 12, 2025 at 04:15:10PM +0100, Eric Auger wrote:

On 3/11/25 3:10 PM, Shameer Kolothum wrote:

Based on SMMUv3 as a parent device, add a user-creatable smmuv3-accel
device. In order to support vfio-pci dev assignment with a Guest

guest

SMMUv3, the physical SMMUv3 has to be configured in nested(S1+s2)

nested (s1+s2)

mode, with Guest owning the S1 page tables. Subsequent patches will

the guest

add support for smmuv3-accel to provide this.

Can't this -accel smmu also works with emulated devices? Do we want an
exclusive usage?

Is there any benefit from emulated devices working in the HW-
accelerated nested translation mode?

Not really but do we have any justification for using different device
name in accel mode? I am not even sure that accel option is really
needed. Ideally the qemu device should be able to detect it is
protecting a VFIO device, in which case it shall check whether nested is
supported by host SMMU and then automatically turn accel mode?

I gave the example of the vfio device which has different class
implementration depending on the iommufd option being set or not.

Do you mean that we should just create a regular smmuv3 device and
let a VFIO device to turn on this smmuv3's accel mode depending on
its LEGACY/IOMMUFD class?

no this is not what I meant. I gave an example where depending on an
option passed to thye VFIO device you choose one class implement or the
other.

Option means something like this:
-device smmuv3,accel=on
instead of
-device "smmuv3-accel"
?

Yea, I think that's good.

Yeah actually that's a big debate for not much. From an implementation
pov that shall not change much. The only doubt I have is if we need to
conditionnaly expose the MSI RESV regions it is easier to do if we
detect we have a smmuv3-accel. what the option allows is the auto mode.



Another question: how does an emulated device work with a vSMMUv3?

I don't get your question. vSMMUv3 currently only works with emulated
devices. Did you mean accelerated SMMUv3?

Yea. If "accel=on", how does an emulated device work with that?


I could imagine that all the accel steps would be bypassed since
!sdev->idev. Yet, the emulated iotlb should cache its translation
so we will need to flush the iotlb, which will increase complexity
as the TLBI command dispatching function will need to be aware what
ASID is for emulated device and what is for vfio device..

I don't get the issue. For emulated device you go through the usual
translate path which indeed caches configs and translations. In case the
guest invalidates something, you know the SID and you find the entries
in the cache that are tagged by this SID.

In case you have an accelerated device (indeed if sdev->idev) you don't
exercise that path. On invalidation you detect the SID matches a VFIO
devoce, propagate the invalidations to the host instead. on the
invalidation you should be able to detect pretty easily if you need to
flush the emulated caches or propagate the invalidations. Do I miss some
extra problematic?

I do not say we should support emulated devices and VFIO devices in the
same guest iommu group. But I don't see why we couldn't easily plug the
accelerated logic in the current logical for emulation/vhost and do not
require a different qemu device.

Hmm, feels like I fundamentally misunderstood your point.
  a) We implement the device model with the same piece of code but
 only provide an option "accel=on/off" to switch mode. And both
 passthrough devices and emulated devices can attach to the same
 "accel=on" device.

I think we all agree we don't want that use case in general. However
effectively I was questioning why it couldn't work maybe at the expense
of some perf degration.

  b) We implement the device model with the same piece of code but
 only provide an option "accel=on/off" to switch mode. Then, an
 passthrough device can attach to an "accel=on" device, but an
 emulated device can only attach to an "accel=off" SMMU device.

I was thinking that you want case (a). But actually you were just
talking about case (b)? I think (b) is totally fine.

We certainly can't do case (a): not all TLBI commands gives an "SID"
field (so would have to broadcast, i.e. underlying SMMU HW would run
commands that were supposed for emulated devices only); in case of
vCMDQ, commands for emulated devices would be issued to real HW and

I am still confused about that. For instance if the guest sends an
NH_ASID, NH_VA invalidation and it happens both the emulated device and
VFIO-device share the same cd.asid (same guest iommu domain, which
practically should not happen) why shouldn't we propagate the

it can't ...

Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device

2025-03-20 Thread Donald Dutile




On 3/19/25 1:04 PM, Eric Auger wrote:




On 3/18/25 10:22 PM, Donald Dutile wrote:



On 3/18/25 3:13 PM, Nicolin Chen wrote:

On Tue, Mar 18, 2025 at 07:31:36PM +0100, Eric Auger wrote:

On 3/17/25 9:19 PM, Nicolin Chen wrote:

On Mon, Mar 17, 2025 at 04:24:53PM -0300, Jason Gunthorpe wrote:

On Mon, Mar 17, 2025 at 12:10:19PM -0700, Nicolin Chen wrote:

Another question: how does an emulated device work with a vSMMUv3?
I could imagine that all the accel steps would be bypassed since
!sdev->idev. Yet, the emulated iotlb should cache its translation
so we will need to flush the iotlb, which will increase complexity
as the TLBI command dispatching function will need to be aware what
ASID is for emulated device and what is for vfio device..

I think you should block it. We already expect different vSMMU's
depending on the physical SMMU under the PCI device, it makes sense
that a SW VFIO device would have it's own, non-accelerated, vSMMU
model in the guest.

Yea, I agree and it'd be cleaner for an implementation separating
them.

In my mind, the general idea of "accel=on" is also to keep things
in a more efficient way: passthrough devices go to HW-accelerated
vSMMUs (separated PCIE buses), while emulated ones go to a vSMMU-
bypassed (PCIE0).



Originally a specific SMMU device was needed to opt in for MSI reserved
region ACPI IORT description which are not needed if you don't rely on
S1+S2. However if we don't rely on this trick this was not even needed
with legacy integration
(https://patchwork.kernel.org/project/qemu-devel/cover/20180921081819.9203-1-eric.au...@redhat.com/).


Nevertheless I don't think anything prevents the acceleration granted
device from also working with virtio/vhost devices for instance unless
you unplug the existing infra. The translation and invalidation just
should use different control paths (explicit translation requests,
invalidation notifications towards vhost, ...).


smmuv3_translate() is per sdev, so it's easy.

Invalidation is done via commands, which could be tricky:
a) Broadcast command
b) ASID validation -- we'll need to keep track of a list of ASIDs
     for vfio device to compare the ASID in each per-ASID command,
     potentially by trapping all CFGI_CD(_ALL) commands? Note that
     each vfio device may have multiple ASIDs (for multiple CDs).
Either a or b above will have some validation efficiency impact.


Again, what does legitimate to have different qemu devices for the same
IP? I understand that it simplifies the implementation but I am not
sure
this is a good reason. Nevertheless it worth challenging. What is the
plan for intel iommu? Will we have 2 devices, the legacy device and one
for nested?


Hmm, it seems that there are two different topics:
1. Use one SMMU device model (source code file; "iommu=" string)
     for both an emulated vSMMU and an HW-accelerated vSMMU.
2. Allow one vSMMU instance to work with both an emulated device
     and a passthrough device.
And I get that you want both 1 and 2.

I'm totally okay with 1, yet see no compelling benefit from 2 for
the increased complexity in the invalidation routine.

And another question about the mixed device attachment. Let's say
we have in the host:
    VFIO passthrough dev0 -> pSMMU0
    VFIO passthrough dev1 -> pSMMU1
Should we allow emulated devices to be flexibly plugged?
    dev0 -> vSMMU0 /* Hard requirement */
    dev1 -> vSMMU1 /* Hard requirement */
    emu0 -> vSMMU0 /* Soft requirement; can be vSMMU1 also */
    emu1 -> vSMMU1 /* Soft requirement; can be vSMMU0 also */

Thanks
Nicolin


I agree w/Jason & Nicolin: different vSMMUs for pass-through devices
than emulated, & vice-versa.
Not mixing... because... of the next agreement:

you need to clarify what you mean by different vSMMUs: are you taking
about different instances or different qemu device types?

Both. a device needed to use hw-accel feature has to use an smmu that has that 
feature;
an emulated device can use such an smmu, but as mentioned in other threads,
if you start with all emulated in one smmu, if you hot-plug a (assigned) device,
it needs another smmu that has hw-accel features.
Keeping them split makes it easier at config time, and it may enable the code 
to be simpler...
but the other half of my brain wants common code paths with accel/emulate 
branches but
a different smmu instance will like simplify the smmu-(accel-)specific lookups.



I agree with Eric that 'accel' isn't needed -- this should be
ascertained from the pSMMU that a physical device is attached to.

we can simply use an AUTO_ON_OFF property and by default choose AUTO
value. That would close the debate ;-)


Preaching to the choir... yes.


Eric

Now... how does vfio(?; why not qemu?) layer determine that? -- where
are SMMUv3 'accel' features exposed either: a) in the device struct
(for the smmuv3) or (b) somewhere under sysfs?

Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus

2025-03-20 Thread Donald Dutile




On 3/19/25 2:21 PM, Eric Auger wrote:

Hi Don,


On 3/19/25 5:21 PM, Donald Dutile wrote:



On 3/19/25 5:26 AM, Shameerali Kolothum Thodi wrote:

Hi Don,


Hey!


-Original Message-
From: Donald Dutile 
Sent: Tuesday, March 18, 2025 10:12 PM
To: Shameerali Kolothum Thodi
; qemu-...@nongnu.org;
qemu-devel@nongnu.org
Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-
pcie bus

Shameer,

Hi!

On 3/11/25 10:10 AM, Shameer Kolothum wrote:

User must associate a pxb-pcie root bus to smmuv3-accel
and that is set as the primary-bus for the smmu dev.

Signed-off-by: Shameer Kolothum



---
    hw/arm/smmuv3-accel.c | 19 +++
    1 file changed, 19 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c327661636..1471b65374 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -9,6 +9,21 @@
    #include "qemu/osdep.h"

    #include "hw/arm/smmuv3-accel.h"
+#include "hw/pci/pci_bridge.h"
+
+static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque)
+{
+    DeviceState *d = opaque;
+
+    if (object_dynamic_cast(obj, "pxb-pcie-bus")) {
+    PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus;
+    if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus-
name)) {
+    object_property_set_link(OBJECT(d), "primary-bus",
OBJECT(bus),
+ &error_abort);
+    }
+    }
+    return 0;
+}

    static void smmu_accel_realize(DeviceState *d, Error **errp)
    {
@@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error

**errp)

    SysBusDevice *dev = SYS_BUS_DEVICE(d);
    Error *local_err = NULL;

+    object_child_foreach_recursive(object_get_root(),
+   smmuv3_accel_pxb_pcie_bus, d);
+
    object_property_set_bool(OBJECT(dev), "accel", true,
&error_abort);
    c->parent_realize(d, &local_err);
    if (local_err) {
@@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass

*klass, void *data)

    device_class_set_parent_realize(dc, smmu_accel_realize,
    &c->parent_realize);
    dc->hotpluggable = false;
+    dc->bus_type = TYPE_PCIE_BUS;
    }

    static const TypeInfo smmuv3_accel_type_info = {


I am not seeing the need for a pxb-pcie bus(switch) introduced for each
'accel'.
Isn't the IORT able to define different SMMUs for different RIDs?
if so,
itsn't that sufficient
to associate (define) an SMMU<->RID association without introducing a
pxb-pcie?
and again, I'm not sure how that improves/enables the device<->SMMU
associativity?


Thanks for taking a look at the series. As discussed elsewhere in
this thread(with
Eric), normally in physical world (or atleast in the most common
cases) SMMUv3
is attached to PCIe Root Complex and if you take a look at the IORT
spec, it describes
association of ID mappings between a RC node and SMMUV3 node.

And if my understanding is correct, in Qemu, only pxb-pcie allows you
to add
extra root complexes even though it is still plugged to
parent(pcie.0). ie, for all
devices downstream it acts as a root complex but still plugged into a
parent pcie.0.
This allows us to add/describe multiple "smmuv3-accel" each
associated with a RC.


I find the qemu statements a bit unclear here as well.
I looked at the hot plug statement(s) in docs/pcie.txt, as I figured
that's where dynamic
IORT changes would be needed as well.  There, it says you can hot-add
PCIe devices to RPs,
one has to define/add RP's to the machine model for that plug-in.

Using libvirt, it could auto-add the needed RPs to do dynmaic smmuv3
additions,

I am not sure I understand your statement here. we don't want "dynamic"
SMMUv3 instantiation. SMMUv3 is a platform device which is supposed to
be coldplugged on a pre-existing PCIe hierarchy. The SMMUv3 device is
not something that is meant to be hotplugged or hotunplugged.
To me we hijack the bus= property to provide information about the IORT
IDMAP


Dynamic in the sense that if one adds smmuv3 for multiple devices,
libvirt will dynamically figure out how to instantiate one, two, three... smmu's
in the machine at cold boot.
If you want a machine to be able to hot-plug a device that would require 
another smmu,
than the config, and smmu, would have to be explicilty stated; as is done today 
for
hot-plug PCIe if the simple machine that libvirt would make is not sufficient to
hot-add a PCIe device.


Thanks

Eric

if I understand how libvirt does that today for pcie devices now (/me
looks at d

Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus

2025-03-24 Thread Donald Dutile




On 3/24/25 10:56 AM, Eric Auger wrote:



On 3/21/25 1:59 AM, Donald Dutile wrote:



On 3/19/25 2:21 PM, Eric Auger wrote:

Hi Don,


On 3/19/25 5:21 PM, Donald Dutile wrote:



On 3/19/25 5:26 AM, Shameerali Kolothum Thodi wrote:

Hi Don,


Hey!


-Original Message-
From: Donald Dutile 
Sent: Tuesday, March 18, 2025 10:12 PM
To: Shameerali Kolothum Thodi
; qemu-...@nongnu.org;
qemu-devel@nongnu.org
Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a
pxb-
pcie bus

Shameer,

Hi!

On 3/11/25 10:10 AM, Shameer Kolothum wrote:

User must associate a pxb-pcie root bus to smmuv3-accel
and that is set as the primary-bus for the smmu dev.

Signed-off-by: Shameer Kolothum



---
     hw/arm/smmuv3-accel.c | 19 +++
     1 file changed, 19 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c327661636..1471b65374 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -9,6 +9,21 @@
     #include "qemu/osdep.h"

     #include "hw/arm/smmuv3-accel.h"
+#include "hw/pci/pci_bridge.h"
+
+static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque)
+{
+    DeviceState *d = opaque;
+
+    if (object_dynamic_cast(obj, "pxb-pcie-bus")) {
+    PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus;
+    if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus-
name)) {
+    object_property_set_link(OBJECT(d), "primary-bus",
OBJECT(bus),
+ &error_abort);
+    }
+    }
+    return 0;
+}

     static void smmu_accel_realize(DeviceState *d, Error **errp)
     {
@@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d,
Error

**errp)

     SysBusDevice *dev = SYS_BUS_DEVICE(d);
     Error *local_err = NULL;

+    object_child_foreach_recursive(object_get_root(),
+   smmuv3_accel_pxb_pcie_bus, d);
+
     object_property_set_bool(OBJECT(dev), "accel", true,
&error_abort);
     c->parent_realize(d, &local_err);
     if (local_err) {
@@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass

*klass, void *data)

     device_class_set_parent_realize(dc, smmu_accel_realize,
     &c->parent_realize);
     dc->hotpluggable = false;
+    dc->bus_type = TYPE_PCIE_BUS;
     }

     static const TypeInfo smmuv3_accel_type_info = {


I am not seeing the need for a pxb-pcie bus(switch) introduced for
each
'accel'.
Isn't the IORT able to define different SMMUs for different RIDs?
if so,
itsn't that sufficient
to associate (define) an SMMU<->RID association without introducing a
pxb-pcie?
and again, I'm not sure how that improves/enables the device<->SMMU
associativity?


Thanks for taking a look at the series. As discussed elsewhere in
this thread(with
Eric), normally in physical world (or atleast in the most common
cases) SMMUv3
is attached to PCIe Root Complex and if you take a look at the IORT
spec, it describes
association of ID mappings between a RC node and SMMUV3 node.

And if my understanding is correct, in Qemu, only pxb-pcie allows you
to add
extra root complexes even though it is still plugged to
parent(pcie.0). ie, for all
devices downstream it acts as a root complex but still plugged into a
parent pcie.0.
This allows us to add/describe multiple "smmuv3-accel" each
associated with a RC.


I find the qemu statements a bit unclear here as well.
I looked at the hot plug statement(s) in docs/pcie.txt, as I figured
that's where dynamic
IORT changes would be needed as well.  There, it says you can hot-add
PCIe devices to RPs,
one has to define/add RP's to the machine model for that plug-in.

Using libvirt, it could auto-add the needed RPs to do dynmaic smmuv3
additions,

I am not sure I understand your statement here. we don't want "dynamic"
SMMUv3 instantiation. SMMUv3 is a platform device which is supposed to
be coldplugged on a pre-existing PCIe hierarchy. The SMMUv3 device is
not something that is meant to be hotplugged or hotunplugged.
To me we hijack the bus= property to provide information about the IORT
IDMAP


Dynamic in the sense that if one adds smmuv3 for multiple devices,
libvirt will dynamically figure out how to instantiate one, two,
three... smmu's
in the machine at cold boot.
If you want a machine to be able to hot-plug a device that would
require another smmu,
than the config, and smmu, would have to be explicilty stated; as is
done today for
hot-plug PCIe if the simple machine that libvirt would make is not
sufficient to
hot-add a PCIe devic

Re: [RFC PATCH v2 16/20] hw/arm/smmuv3-accel: Read host SMMUv3 device info

2025-03-18 Thread Donald Dutile

Shameer,

Hey,


On 3/11/25 10:10 AM, Shameer Kolothum wrote:

From: Nicolin Chen 

Read the underlying SMMUv3 device info and set corresponding IDR
bits. We need at least one cold-plugged vfio-pci dev associated
with the smmuv3-accel instance to do this now.  Hence fail if it
is not available.

ToDo: The above requirement will be relaxed in future when we add
support in the kernel.

Signed-off-by: Nicolin Chen 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmuv3-accel.c | 104 ++
  hw/arm/trace-events   |   1 +
  include/hw/arm/smmuv3-accel.h |   2 +
  3 files changed, 107 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 09be838d22..fb08e1d66b 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -15,6 +15,96 @@
  
  #include "smmuv3-internal.h"
  
+static int

+smmuv3_accel_dev_get_info(SMMUv3AccelDevice *accel_dev, uint32_t *data_type,
+  uint32_t data_len, void *data)
+{
+uint64_t caps;
+
+if (!accel_dev || !accel_dev->idev) {
+return -ENOENT;
+}
+
+return !iommufd_backend_get_device_info(accel_dev->idev->iommufd,
+accel_dev->idev->devid,
+data_type, data,
+data_len, &caps, NULL);
+}
+
+static void smmuv3_accel_init_regs(SMMUv3AccelState *s_accel)
+{
+SMMUv3State *s = ARM_SMMUV3(s_accel);
+SMMUv3AccelDevice *accel_dev;
+uint32_t data_type;
+uint32_t val;
+int ret;
+
+if (!s_accel->viommu || QLIST_EMPTY(&s_accel->viommu->device_list)) {
+error_report("At least one cold-plugged vfio-pci is required for 
smmuv3-accel!");
+exit(1);
+}
+
+accel_dev = QLIST_FIRST(&s_accel->viommu->device_list);
+if (accel_dev->info.idr[0]) {
+info_report("reusing the previous hw_info");
+goto out;
+}
+
+ret = smmuv3_accel_dev_get_info(accel_dev, &data_type,
+sizeof(accel_dev->info), &accel_dev->info);
+if (ret) {
+error_report("failed to get SMMU device info");
+return;
+}
+
+if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
+error_report("Wrong data type (%d)!", data_type);
+return;
+}
+
+out:
+trace_smmuv3_accel_get_device_info(accel_dev->info.idr[0],
+   accel_dev->info.idr[1],
+   accel_dev->info.idr[3],
+   accel_dev->info.idr[5]);
+
+val = FIELD_EX32(accel_dev->info.idr[0], IDR0, BTM);
+s->idr[0] = FIELD_DP32(s->idr[0], IDR0, BTM, val);
+val = FIELD_EX32(accel_dev->info.idr[0], IDR0, ATS);
+s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, val);
+val = FIELD_EX32(accel_dev->info.idr[0], IDR0, ASID16);
+s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ASID16, val);
+val = FIELD_EX32(accel_dev->info.idr[0], IDR0, TERM_MODEL);
+s->idr[0] = FIELD_DP32(s->idr[0], IDR0, TERM_MODEL, val);
+val = FIELD_EX32(accel_dev->info.idr[0], IDR0, STALL_MODEL);
+s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STALL_MODEL, val);
+val = FIELD_EX32(accel_dev->info.idr[0], IDR0, STLEVEL);
+s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STLEVEL, val);
+
+val = FIELD_EX32(accel_dev->info.idr[1], IDR1, SIDSIZE);
+s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SIDSIZE, val);
+val = FIELD_EX32(accel_dev->info.idr[1], IDR1, SSIDSIZE);
+s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SSIDSIZE, val);
+
+val = FIELD_EX32(accel_dev->info.idr[3], IDR3, HAD);
+s->idr[3] = FIELD_DP32(s->idr[3], IDR3, HAD, val);
+val = FIELD_EX32(accel_dev->info.idr[3], IDR3, RIL);
+s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, val);
+val = FIELD_EX32(accel_dev->info.idr[3], IDR3, BBML);
+s->idr[3] = FIELD_DP32(s->idr[3], IDR3, BBML, val);
+
+val = FIELD_EX32(accel_dev->info.idr[5], IDR5, GRAN4K);
+s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, val);
+val = FIELD_EX32(accel_dev->info.idr[5], IDR5, GRAN16K);
+s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, val);
+val = FIELD_EX32(accel_dev->info.idr[5], IDR5, GRAN64K);
+s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, val);
+val = FIELD_EX32(accel_dev->info.idr[5], IDR5, OAS);
+s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, val);
+
+/* FIXME check iidr and aidr registrs too */
+}
+
  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *s, SMMUPciBus *sbus,
  PCIBus *bus, int devfn)
  {
@@ -484,11 +574,25 @@ static void smmu_accel_realize(DeviceState *d, Error 
**errp)
  bs->unset_iommu_device = smmuv3_accel_unset_iommu_device;
  }
  
+static void smmuv3_accel_reset_hold(Object *obj, ResetType type)

+{
+SMMUv3AccelState *s = ARM_SMMUV3_ACCEL(obj);
+SMMUv3AccelClass *c = ARM_SMMUV3_ACCEL_GET_CLASS(s);
+
+if (c->parent_phases.hold) {
+c->

Re: [RFC PATCH v2 18/20] hw/arm/smmu-common: Bypass emulated IOTLB for a accel SMMUv3

2025-03-18 Thread Donald Dutile




On 3/11/25 10:10 AM, Shameer Kolothum wrote:

From: Nicolin Chen 

If a vSMMU is configured as a accelerated one, HW IOTLB will be used
and all cache invalidation should be done to the HW IOTLB too, v.s.
the emulated iotlb. In this case, an iommu notifier isn't registered,
as the devices behind a SMMUv3-accel would stay in the system address
space for stage-2 mappings.

However, the KVM code still requests an iommu address space to translate
an MSI doorbell gIOVA via get_msi_address_space() and translate().

Since a SMMUv3-accel doesn't register an iommu notifier to flush emulated
iotlb, bypass the emulated IOTLB and always walk through the guest-level
IO page table.

Signed-off-by: Nicolin Chen 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmu-common.c | 21 +
  1 file changed, 21 insertions(+)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 9fd455baa0..fd10df8866 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -77,6 +77,17 @@ static SMMUTLBEntry *smmu_iotlb_lookup_all_levels(SMMUState 
*bs,
  uint8_t level = 4 - (inputsize - 4) / stride;
  SMMUTLBEntry *entry = NULL;
  
+/*

+ * Stage-1 translation with a accel SMMU in general uses HW IOTLB. However,
+ * KVM still requests for an iommu address space for an MSI fixup by 
looking
+ * up stage-1 page table. Make sure we don't go through the emulated 
pathway
+ * so that the emulated iotlb will not need any invalidation.
+ */
+
+if (bs->accel) {
+return NULL;
+}
+
  while (level <= 3) {
  uint64_t subpage_size = 1ULL << level_shift(level, tt->granule_sz);
  uint64_t mask = subpage_size - 1;
@@ -142,6 +153,16 @@ void smmu_iotlb_insert(SMMUState *bs, SMMUTransCfg *cfg, 
SMMUTLBEntry *new)
  SMMUIOTLBKey *key = g_new0(SMMUIOTLBKey, 1);
  uint8_t tg = (new->granule - 10) / 2;
  
+/*

+ * Stage-1 translation with a accel SMMU in general uses HW IOTLB. However,
+ * KVM still requests for an iommu address space for an MSI fixup by 
looking
+ * up stage-1 page table. Make sure we don't go through the emulated 
pathway
+ * so that the emulated iotlb will not need any invalidation.
+ */
+if (bs->accel) {
+return;
+}
+
  if (g_hash_table_size(bs->iotlb) >= SMMU_IOTLB_MAX_SIZE) {
  smmu_iotlb_inv_all(bs);
  }


Ah! ... if 'accel', skip emulated code since hw handling it... in common smmu 
code... I like it! :)
- Don




Re: [RFC PATCH v2 13/20] hw/arm/smmuv3-accel: Introduce helpers to batch and issue cache invalidations

2025-03-18 Thread Donald Dutile

Shameer,

Hi,


On 3/11/25 10:10 AM, Shameer Kolothum wrote:

From: Nicolin Chen 

Inroduce an SMMUCommandBatch and some helpers to batch and issue the

   Introduce

commands.  Currently separate out TLBI commands and device cache commands
to avoid some errata on certain versions of SMMUs. Later it should check
IIDR register to detect if underlying SMMU hw has such an erratum.

Where is all this info about 'certain versions of SMMUs' and
'check IIDR register' has something to do with 'underlying SMMU hw such an 
erratum',
-- which IIDR (& bits)? or are we talking about rsvd SMMU_IDR<> registers?


And can't these helpers be used for emulated smmuv3 as well as accelerated?

Thanks,
- Don


Signed-off-by: Nicolin Chen 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmuv3-accel.c| 69 
  hw/arm/smmuv3-internal.h | 29 +
  2 files changed, 98 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 76134d106a..09be838d22 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -160,6 +160,75 @@ void smmuv3_accel_install_nested_ste(SMMUDevice *sdev, int 
sid)
nested_data.ste[0]);
  }
  
+/* Update batch->ncmds to the number of execute cmds */

+int smmuv3_accel_issue_cmd_batch(SMMUState *bs, SMMUCommandBatch *batch)
+{
+SMMUv3AccelState *s_accel = ARM_SMMUV3_ACCEL(bs);
+uint32_t total = batch->ncmds;
+IOMMUFDViommu *viommu_core;
+int ret;
+
+if (!bs->accel) {
+return 0;
+}
+
+if (!s_accel->viommu) {
+return 0;
+}
+viommu_core = &s_accel->viommu->core;
+ret = iommufd_backend_invalidate_cache(viommu_core->iommufd,
+   viommu_core->viommu_id,
+   
IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3,
+   sizeof(Cmd), &batch->ncmds,
+   batch->cmds);
+if (total != batch->ncmds) {
+error_report("%s failed: ret=%d, total=%d, done=%d",
+  __func__, ret, total, batch->ncmds);
+return ret;
+}
+
+batch->ncmds = 0;
+batch->dev_cache = false;
+return ret;
+}
+
+int smmuv3_accel_batch_cmds(SMMUState *bs, SMMUDevice *sdev,
+SMMUCommandBatch *batch, Cmd *cmd,
+uint32_t *cons, bool dev_cache)
+{
+int ret;
+
+if (!bs->accel) {
+return 0;
+}
+
+if (sdev) {
+SMMUv3AccelDevice *accel_dev;
+accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
+if (!accel_dev->s1_hwpt) {
+return 0;
+}
+}
+
+/*
+ * Currently separate out dev_cache and hwpt for safety, which might
+ * not be necessary if underlying HW SMMU does not have the errata.
+ *
+ * TODO check IIDR register values read from hw_info.
+ */
+if (batch->ncmds && (dev_cache != batch->dev_cache)) {
+ret = smmuv3_accel_issue_cmd_batch(bs, batch);
+if (ret) {
+*cons = batch->cons[batch->ncmds];
+return ret;
+}
+}
+batch->dev_cache = dev_cache;
+batch->cmds[batch->ncmds] = *cmd;
+batch->cons[batch->ncmds++] = *cons;
+return 0;
+}
+
  static bool
  smmuv3_accel_dev_attach_viommu(SMMUv3AccelDevice *accel_dev,
 HostIOMMUDeviceIOMMUFD *idev, Error **errp)
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index 46c8bcae14..4602ae6728 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -549,13 +549,42 @@ typedef struct CD {
  uint32_t word[16];
  } CD;
  
+/**

+ * SMMUCommandBatch - batch of invalidation commands for smmuv3-accel
+ * @cmds: Pointer to list of commands
+ * @cons: Pointer to list of CONS corresponding to the commands
+ * @ncmds: Total ncmds in the batch
+ * @dev_cache: Issue to a device cache
+ */
+typedef struct SMMUCommandBatch {
+Cmd *cmds;
+uint32_t *cons;
+uint32_t ncmds;
+bool dev_cache;
+} SMMUCommandBatch;
+
  int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste,
SMMUEventInfo *event);
  void smmuv3_flush_config(SMMUDevice *sdev);
  
  #if defined(CONFIG_ARM_SMMUV3_ACCEL) && defined(CONFIG_IOMMUFD)

+int smmuv3_accel_issue_cmd_batch(SMMUState *bs, SMMUCommandBatch *batch);
+int smmuv3_accel_batch_cmds(SMMUState *bs, SMMUDevice *sdev,
+SMMUCommandBatch *batch, Cmd *cmd,
+uint32_t *cons, bool dev_cache);
  void smmuv3_accel_install_nested_ste(SMMUDevice *sdev, int sid);
  #else
+static inline int smmuv3_accel_issue_cmd_batch(SMMUState *bs,
+   SMMUCommandBatch *batch)
+{
+return 0;
+}
+static inline int smmuv3_accel_batch_cmds(SMMUState *bs, SMMUDevice *sdev,
+  SMMUCommandBatch *batch, Cmd *cmd,
+

Re: [RFC PATCH v2 11/20] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device

2025-03-18 Thread Donald Dutile

Shameer,

Hi,

On 3/11/25 10:10 AM, Shameer Kolothum wrote:

From: Nicolin Chen 

Allocate and associate a vDEVICE object for the Guest device
with the vIOMMU. This will help the kernel to do the
vSID --> sid translation whenever required (eg: device specific
invalidations).

Signed-off-by: Nicolin Chen 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmuv3-accel.c | 22 ++
  include/hw/arm/smmuv3-accel.h |  6 ++
  2 files changed, 28 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index d3a5cf9551..056bd23b2e 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -109,6 +109,20 @@ void smmuv3_accel_install_nested_ste(SMMUDevice *sdev, int 
sid)
  return;
  }
  
+if (!accel_dev->vdev && accel_dev->idev) {

+SMMUVdev *vdev;
+uint32_t vdev_id;
+SMMUViommu *viommu = accel_dev->viommu;
+
+iommufd_backend_alloc_vdev(viommu->core.iommufd, 
accel_dev->idev->devid,
+   viommu->core.viommu_id, sid, &vdev_id,
+   &error_abort);
+vdev = g_new0(SMMUVdev, 1);
+vdev->vdev_id = vdev_id;
+vdev->sid = sid;
+accel_dev->vdev = vdev;
+}
+
  ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
  if (ret) {
  /*
@@ -283,6 +297,7 @@ static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void 
*opaque, int devfn,
  static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
  int devfn)
  {
+SMMUVdev *vdev;
  SMMUDevice *sdev;
  SMMUv3AccelDevice *accel_dev;
  SMMUViommu *viommu;
@@ -312,6 +327,13 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, 
void *opaque,
  trace_smmuv3_accel_unset_iommu_device(devfn, smmu_get_sid(sdev));
  
  viommu = s_accel->viommu;

+vdev = accel_dev->vdev;
+if (vdev) {
+iommufd_backend_free_id(viommu->iommufd, vdev->vdev_id);
+g_free(vdev);
+accel_dev->vdev = NULL;
+}
+
  if (QLIST_EMPTY(&viommu->device_list)) {
  iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id);
  iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id);
diff --git a/include/hw/arm/smmuv3-accel.h b/include/hw/arm/smmuv3-accel.h
index d6b0b1ca30..54b217ab4f 100644
--- a/include/hw/arm/smmuv3-accel.h
+++ b/include/hw/arm/smmuv3-accel.h
@@ -35,6 +35,11 @@ typedef struct SMMUViommu {
  QLIST_ENTRY(SMMUViommu) next;
  } SMMUViommu;
  
+typedef struct SMMUVdev {

+uint32_t vdev_id;
+uint32_t sid;
+} SMMUVdev;
+

Shouldn't this be 'IOMMUFDVdev' ... it's not an SMMU (v)dev , it's an 
IOMMUFD/vIOMMU vDEVICE for this SMMU



  typedef struct SMMUS1Hwpt {
  IOMMUFDBackend *iommufd;
  uint32_t hwpt_id;
@@ -45,6 +50,7 @@ typedef struct SMMUv3AccelDevice {
  HostIOMMUDeviceIOMMUFD *idev;
  SMMUS1Hwpt  *s1_hwpt;
  SMMUViommu *viommu;
+SMMUVdev   *vdev;
  QLIST_ENTRY(SMMUv3AccelDevice) next;
  } SMMUv3AccelDevice;
  





Re: [RFC PATCH v2 04/20] hw/arm/virt: Add support for smmuv3-accel

2025-03-18 Thread Donald Dutile



Doesn't this commit become moot, if accel becomes an smmuv3 option vs separate 
device object altogether,
dynamically added if a pdev is attached to a host SMMUv3 that has accel 
feature(s)?

Blocking w/virtio-iommu falls under the same situation mentioned in 03/20 wrt 
mixing emulated & physical devices on the same smmuv3.

- Don

On 3/11/25 10:10 AM, Shameer Kolothum wrote:

Allow cold-plug smmuv3-accel to virt If the machine wide smmuv3
is not specified.

No FDT support is added for now.

Signed-off-by: Shameer Kolothum 
---
  hw/arm/virt.c | 12 
  hw/core/sysbus-fdt.c  |  1 +
  include/hw/arm/virt.h |  1 +
  3 files changed, 14 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 4a5a9666e9..84a323da55 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -73,6 +73,7 @@
  #include "qobject/qlist.h"
  #include "standard-headers/linux/input.h"
  #include "hw/arm/smmuv3.h"
+#include "hw/arm/smmuv3-accel.h"
  #include "hw/acpi/acpi.h"
  #include "target/arm/cpu-qom.h"
  #include "target/arm/internals.h"
@@ -2911,6 +2912,16 @@ static void virt_machine_device_plug_cb(HotplugHandler 
*hotplug_dev,
  
platform_bus_link_device(PLATFORM_BUS_DEVICE(vms->platform_bus_dev),
   SYS_BUS_DEVICE(dev));
  }
+if (object_dynamic_cast(OBJECT(dev), TYPE_ARM_SMMUV3_ACCEL)) {
+if (vms->iommu == VIRT_IOMMU_SMMUV3) {
+error_setg(errp,
+   "iommu=smmuv3 is already specified. can't create 
smmuv3-accel dev");
+return;
+}
+if (vms->iommu != VIRT_IOMMU_SMMUV3_ACCEL) {
+vms->iommu = VIRT_IOMMU_SMMUV3_ACCEL;
+}
+}
  }
  
  if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {

@@ -3120,6 +3131,7 @@ static void virt_machine_class_init(ObjectClass *oc, void 
*data)
  machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_AMD_XGBE);
  machine_class_allow_dynamic_sysbus_dev(mc, TYPE_RAMFB_DEVICE);
  machine_class_allow_dynamic_sysbus_dev(mc, TYPE_VFIO_PLATFORM);
+machine_class_allow_dynamic_sysbus_dev(mc, TYPE_ARM_SMMUV3_ACCEL);
  #ifdef CONFIG_TPM
  machine_class_allow_dynamic_sysbus_dev(mc, TYPE_TPM_TIS_SYSBUS);
  #endif
diff --git a/hw/core/sysbus-fdt.c b/hw/core/sysbus-fdt.c
index 774c0aed41..c8502ad830 100644
--- a/hw/core/sysbus-fdt.c
+++ b/hw/core/sysbus-fdt.c
@@ -489,6 +489,7 @@ static const BindingEntry bindings[] = {
  #ifdef CONFIG_LINUX
  TYPE_BINDING(TYPE_VFIO_CALXEDA_XGMAC, add_calxeda_midway_xgmac_fdt_node),
  TYPE_BINDING(TYPE_VFIO_AMD_XGBE, add_amd_xgbe_fdt_node),
+TYPE_BINDING("arm-smmuv3-accel", no_fdt_node),
  VFIO_PLATFORM_BINDING("amd,xgbe-seattle-v1a", add_amd_xgbe_fdt_node),
  #endif
  #ifdef CONFIG_TPM
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index c8e94e6aed..849d1cd5b5 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -92,6 +92,7 @@ enum {
  typedef enum VirtIOMMUType {
  VIRT_IOMMU_NONE,
  VIRT_IOMMU_SMMUV3,
+VIRT_IOMMU_SMMUV3_ACCEL,
  VIRT_IOMMU_VIRTIO,
  } VirtIOMMUType;
  





Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-pcie bus

2025-03-19 Thread Donald Dutile




On 3/19/25 5:26 AM, Shameerali Kolothum Thodi wrote:

Hi Don,


Hey!


-Original Message-
From: Donald Dutile 
Sent: Tuesday, March 18, 2025 10:12 PM
To: Shameerali Kolothum Thodi
; qemu-...@nongnu.org;
qemu-devel@nongnu.org
Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [RFC PATCH v2 05/20] hw/arm/smmuv3-accel: Associate a pxb-
pcie bus

Shameer,

Hi!

On 3/11/25 10:10 AM, Shameer Kolothum wrote:

User must associate a pxb-pcie root bus to smmuv3-accel
and that is set as the primary-bus for the smmu dev.

Signed-off-by: Shameer Kolothum



---
   hw/arm/smmuv3-accel.c | 19 +++
   1 file changed, 19 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c327661636..1471b65374 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -9,6 +9,21 @@
   #include "qemu/osdep.h"

   #include "hw/arm/smmuv3-accel.h"
+#include "hw/pci/pci_bridge.h"
+
+static int smmuv3_accel_pxb_pcie_bus(Object *obj, void *opaque)
+{
+DeviceState *d = opaque;
+
+if (object_dynamic_cast(obj, "pxb-pcie-bus")) {
+PCIBus *bus = PCI_HOST_BRIDGE(obj->parent)->bus;
+if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus-
name)) {
+object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus),
+ &error_abort);
+}
+}
+return 0;
+}

   static void smmu_accel_realize(DeviceState *d, Error **errp)
   {
@@ -17,6 +32,9 @@ static void smmu_accel_realize(DeviceState *d, Error

**errp)

   SysBusDevice *dev = SYS_BUS_DEVICE(d);
   Error *local_err = NULL;

+object_child_foreach_recursive(object_get_root(),
+   smmuv3_accel_pxb_pcie_bus, d);
+
   object_property_set_bool(OBJECT(dev), "accel", true, &error_abort);
   c->parent_realize(d, &local_err);
   if (local_err) {
@@ -33,6 +51,7 @@ static void smmuv3_accel_class_init(ObjectClass

*klass, void *data)

   device_class_set_parent_realize(dc, smmu_accel_realize,
   &c->parent_realize);
   dc->hotpluggable = false;
+dc->bus_type = TYPE_PCIE_BUS;
   }

   static const TypeInfo smmuv3_accel_type_info = {


I am not seeing the need for a pxb-pcie bus(switch) introduced for each
'accel'.
Isn't the IORT able to define different SMMUs for different RIDs?   if so,
itsn't that sufficient
to associate (define) an SMMU<->RID association without introducing a
pxb-pcie?
and again, I'm not sure how that improves/enables the device<->SMMU
associativity?


Thanks for taking a look at the series. As discussed elsewhere in this 
thread(with
Eric), normally in physical world (or atleast in the most common cases) SMMUv3
is attached to PCIe Root Complex and if you take a look at the IORT spec, it 
describes
association of ID mappings between a RC node and SMMUV3 node.

And if my understanding is correct, in Qemu, only pxb-pcie allows you to add
extra root complexes even though it is still plugged to parent(pcie.0). ie, for 
all
devices downstream it acts as a root complex but still plugged into a parent 
pcie.0.
This allows us to add/describe multiple "smmuv3-accel" each associated with a 
RC.


I find the qemu statements a bit unclear here as well.
I looked at the hot plug statement(s) in docs/pcie.txt, as I figured that's 
where dynamic
IORT changes would be needed as well.  There, it says you can hot-add PCIe 
devices to RPs,
one has to define/add RP's to the machine model for that plug-in.

Using libvirt, it could auto-add the needed RPs to do dynmaic smmuv3 additions,
if I understand how libvirt does that today for pcie devices now (/me looks at 
danpb for feedback).


Having said that,  current code only allows pxb-pcie root complexes avoiding
the pcie.0. The idea behind this was, user can use pcie.0 with a non accel 
SMMUv3
for any emulated devices avoiding the performance bottlenecks we are
discussing for emulated dev+smmuv3-accel cases. But based on the feedback from
Eric and Daniel I will relax that restriction and will allow association with 
pcie.0.


So, I think this isn't a restriction that this smmuv3 feature should enforce;
lack of a proper RP or pxb-pcie will yield an invalid config issue/error, and
the machine definition will be modified to meet the needs for IORT.


Thanks,
Shameer








  


to root complexes.

Feel free to enlighten me where I may have mis-read/interpreted the IORT
& SMMUv3 specs.

Thanks,
- Don








Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device

2025-04-04 Thread Donald Dutile

Zhenzhong,

Hi!
Eric asked me to review this series.
Since it's rather late since you posted will summarize review feedback 
below/bottom.

- Don

On 2/19/25 3:22 AM, Zhenzhong Duan wrote:

Hi,

Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.

This series is 2nd part focusing on passthrough device. We don't do
shadowing of guest page table for passthrough device but pass stage-1
page table to host side to construct a nested domain. There was some
effort to enable this feature in old days, see [2] for details.

The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

 .-.  .---.
 |   vIOMMU|  | Guest I/O page table  |
 | |  '---'
 ./
 | PASID Entry |--- PASID cache flush --+
 '-'|
 | |V
 | |   I/O page table pointer in GPA
 '-'
 Guest
 --| Shadow |---|
   vv   v
 Host
 .-.  ..
 |   pIOMMU|  |  FS for GIOVA->GPA |
 | |  ''
 ./  |
 | PASID Entry | V (Nested xlate)
 '\.--.
 | |   | SS for GPA->HPA, unmanaged domain|
 | |   '--'
 '-'
Where:
  - FS = First stage page tables
  - SS = Second stage page tables



I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second 
stage.
We don't need different terms for the same technology in the iommu/iommufd 
space(s).


There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
   subsystem. VFIO calls them to register/unregister HostIOMMUDevice
   instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
   to bind/unbind device to IOMMUFD backed domains, either nested
   domain or not.

See below diagram:

 VFIO Device Intel IOMMU
 .-. .---.
 | | |   |
 |   .-|PCIIOMMUOps  |.-.|
 |   | IOMMUFD |(set_iommu_device)   || Host IOMMU  ||
 |   | Device  |>|| Device list ||
 |   .-|(unset_iommu_device) |.-.|
 | | |   |   |
 | | |   V   |
 |   .-|  HostIOMMUDeviceIOMMUFD |  .-.  |
 |   | IOMMUFD |(attach_hwpt)|  | Host IOMMU  |  |
 |   | link|<|  |   Device|  |
 |   .-|(detach_hwpt)|  .-.  |
 | | |   |   |
 | | |   ... |
 .-. .---.

Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.

E.g., Stage-2 page table could be shared by different devices if there
is no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there is

and 'devices under the same guest'.
Different guests cant be sharing the same stage-2 page tables.


conflict, i.e. there is one device under non cache coherency mode
which is different from others, it requires a separate stage-2 page
table in non-CC mode.

SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. If there is a rare case that some IOMMUs
on a multiple IOMMU host have ERRATA_772415 and others not, this
design can still survive.

See below example diagram for a full view:

   IntelIOMMUState
  |
  V
 .--..--..---.
  

Re: [PATCH 2/5] vfio: Move realize() after attach_device()

2025-04-18 Thread Donald Dutile




On 4/11/25 6:17 AM, Zhenzhong Duan wrote:

Previously device attaching depends on realize() getting host iommu capabilities to check dirty 
tracking support. Now we save a caps copy in VFIODevice and check that copy for dirty tracking 
support, there is no dependency any more, move realize() call after attach_device() call in 
vfio_device_attach(). Drop vfio_device_hiod_realize() which looks redundant now. Suggested-by: 
Cédric Le Goater  Suggested-by: Donald Dutile  
Signed-off-by: Zhenzhong Duan 


Reviewed-by: Donald Dutile 




Re: [PATCH 0/5] Add support for user creatable SMMUv3 device

2025-04-18 Thread Donald Dutile

Shameer,
Hi!

First off, like the partitioning of these pieces.

On 4/15/25 4:10 AM, Shameer Kolothum wrote:

Hi All,

This patch series introduces support for a user-creatable SMMUv3 device
(-device arm-smmuv3-dev) in QEMU.


Can we drop the '-dev', as 'arm-smmu' is sufficient, as is '-device'?

I know, internal to QEMU, there's already an ARM_SMMU, but as suggested later,
if it uses the same struct, the qemu cmdline syntax is a bit less redundant.


The implementation is based on feedback received from the RFCv2[0]:
"hw/arm/virt: Add support for user-creatable accelerated SMMUv3"

Currently, QEMU's SMMUv3 emulation (iommu=smmuv3) is tied to the machine
and does not support instantiating multiple SMMUv3 devices—each associated
with a separate PCIe root complex. In contrast, real-world ARM systems
often include multiple SMMUv3 instances, each bound to a different PCIe
root complex.

This also lays the groundwork for supporting accelerated SMMUv3, as
proposed in the aforementioned RFC. Please note, the accelerated SMMUv3
support is not part of this series and will be sent out as a separate
series later on top of this one.

Summary of changes:

  -Introduces support for multiple -device arm-smmuv3-dev,bus=pcie.x
   instances.

   Example usage:

   -device arm-smmuv3-dev,bus=pcie.0
   -device virtio-net-pci,bus=pcie.0
   -device pxb-pcie,id=pcie.1,bus_nr=2
   -device arm-smmuv3-dev,bus=pcie.1
   -device pcie-root-port,id=pcie.port1,bus=pcie.1
   -device virtio-net-pci,bus=pcie.port1

   -Supports either the legacy iommu=smmuv3 option or the new
   "-device arm-smmuv3-dev" model.


nice! :)
Can it support both? i.e., some devices using a system-wide, legacy smmuv3,
and some pcie devices using the -device arm-smmuv3 ?

If not, then add a check to min-warn or max-fail, that config.
I can see old machines being converted/upgraded to new machines,
and they will want to switch from legacy->device smmuv3, and catching
an edit/update to a machine change (or enabling automated conversion) would be 
helpful.


   -Adds device tree bindings for the new SMMUv3 device on the arm/virt
    machine only, and only for the default pcie.0 root complex.
    (Note: pxb-pcie root complexes are currently not supported with the
     device tree due to known limitations[1].)
   -Restricts usage of the new SMMUv3 device to virt machine versions > 9.2.
    This is to avoid creating new smmuv3 device for old virt machine versions
    and accidently breaking the migration support.

Please take a look and let me know your feedback.

Thanks,
Shameer
[0]:https://lore.kernel.org/qemu-devel/20250311141045.66620-1-shameerali.kolothum.th...@huawei.com/
[1]:https://lore.kernel.org/qemu-devel/20230421165037.2506-1-jonathan.came...@huawei.com/

Shameer Kolothum (5):
   hw/arm/smmuv3: Introduce SMMUv3 device
   hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices
   hw/arm/virt: Factor out common SMMUV3 dt bindings code
   hw/arm/virt: Add support for smmuv3 device
   hw/arm/smmuv3: Enable smmuv3 device creation

  hw/arm/smmuv3.c  |  55 ++
  hw/arm/virt-acpi-build.c | 119 +++
  hw/arm/virt.c| 109 ++-
  hw/core/sysbus-fdt.c |   3 +
  include/hw/arm/smmuv3.h  |  16 ++
  include/hw/arm/virt.h|   2 +
  6 files changed, 266 insertions(+), 38 deletions(-)






Re: [PATCH 2/5] hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices

2025-04-18 Thread Donald Dutile




On 4/16/25 1:38 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Nicolin Chen 
Sent: Wednesday, April 16, 2025 5:19 AM
To: Shameerali Kolothum Thodi 
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
ddut...@redhat.com; berra...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [PATCH 2/5] hw/arm/virt-acpi-build: Update IORT for multiple
smmuv3 devices

On Tue, Apr 15, 2025 at 09:11:01AM +0100, Shameer Kolothum wrote:

+static int get_smmuv3_device(Object *obj, void *opaque)
+{
+GArray *sdev_blob = opaque;
+int min_bus, max_bus;
+VirtMachineState *vms;
+PlatformBusDevice *pbus;
+SysBusDevice *sbdev;
+SMMUv3Device sdev;
+hwaddr base;
+int irq;
+PCIBus *bus;


In a reverse christmas tree order? Or we could at least group
those similar types together.


Yeah. Reverse will look better I guess.



+vms = VIRT_MACHINE(qdev_get_machine());
+pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
+sbdev = SYS_BUS_DEVICE(obj);
+base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
+irq = platform_bus_get_irqn(pbus, sbdev, 0);
+
+base += vms->memmap[VIRT_PLATFORM_BUS].base;
+irq += vms->irqmap[VIRT_PLATFORM_BUS];
+
+pci_bus_range(bus, &min_bus, &max_bus);
+sdev.smmu_idmap.input_base = min_bus << 8;
+sdev.smmu_idmap.id_count = (max_bus - min_bus + 1) << 8;
+sdev.base = base;
+sdev.irq = irq + ARM_SPI_BASE;


Hmm, these base/irq local variables don't look necessary.


+g_array_append_val(sdev_blob, sdev);
+return 0;
+}
+
  /*
   * Input Output Remapping Table (IORT)
   * Conforms to "IO Remapping Table System Software on ARM

Platforms",

@@ -275,25 +330,42 @@ static void
  build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState

*vms)

  {
  int i, nb_nodes, rc_mapping_count;
-size_t node_size, smmu_offset = 0;
+size_t node_size, *smmu_offset = NULL;
  AcpiIortIdMapping *idmap;
+int num_smmus = 0;
  uint32_t id = 0;
  GArray *smmu_idmaps = g_array_new(false, true,

sizeof(AcpiIortIdMapping));

  GArray *its_idmaps = g_array_new(false, true,

sizeof(AcpiIortIdMapping));

+GArray *smmuv3_devices = g_array_new(false, true,

sizeof(SMMUv3Device));


  AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id,
  .oem_table_id = vms->oem_table_id };
  /* Table 2 The IORT */
  acpi_table_begin(&table, table_data);

-if (vms->iommu == VIRT_IOMMU_SMMUV3) {
-AcpiIortIdMapping next_range = {0};
-
+nb_nodes = 2; /* RC, ITS */
+if (vms->iommu == VIRT_IOMMU_SMMUV3_DEV) {
+object_child_foreach_recursive(object_get_root(),
+   get_smmuv3_device, smmuv3_devices);
+ /* Sort the smmuv3-devices by smmu idmap input_base */
+g_array_sort(smmuv3_devices, smmuv3_dev_idmap_compare);
+/*  Fill smmu idmap from sorted smmuv3 devices array */
+for (i = 0; i < smmuv3_devices->len; i++) {
+SMMUv3Device *s = &g_array_index(smmuv3_devices,

SMMUv3Device, i);

+g_array_append_val(smmu_idmaps, s->smmu_idmap);
+}
+num_smmus = smmuv3_devices->len;
+} else if (vms->iommu == VIRT_IOMMU_SMMUV3) {
  object_child_foreach_recursive(object_get_root(),
 iort_host_bridges, smmu_idmaps);

  /* Sort the smmu idmap by input_base */
  g_array_sort(smmu_idmaps, iort_idmap_compare);


VIRT_IOMMU_SMMUV3 seems to fit the struct SMMUv3Device very well,
given that it has base, irq, and smmu_idmaps too?


One difference though is the legacy VIRT_IOMMU_SMMUV3 one is a global
Machine wide one nad can be associated with multiple PCIe root complexes
which will result in smmu_idmaps array. See the iort_host_bridges() fn.


Didn't quite follow this logic; shouldn't the iort be built for devices
tagged for a virt-iommu or dependent on a pcie rp (bus num/bus-num range of an 
RP)?
Thus, it's just an IORT with one or more IORT nodes?
I could see how the current iort_host_bridge() may need a revision to work with 
-device arm-smmu configs.

Which would bring me back to my earlier question, and it's likely tied to this 
IORT construction:
 -- can you have both types defined in a machine model?




Maybe we could generalize the struct SMMUv3Device to store both
cases. Perhaps just SMMUv3AcpiInfo? And then ..



I didn't follow 'SMMUv3AcpiInfo' since I don't see that in current qemu tree;
or is the statement trying to say its a 'mere matter of the proper acpi info 
generation'?


Could do. But then you have to have  a smmu_idmaps array and then check
the length of it to cover legacy SMMUv3 case I guess.


Aren't they going to be different idmaps for different IORT nodes?


  > > @@ -341,10 +413,20 @@ build_

Re: [PATCH v2 0/6] Add support for user creatable SMMUv3 device

2025-05-02 Thread Donald Dutile




On 5/2/25 6:27 AM, Shameer Kolothum wrote:

Hi All,

Changes from v1:
https://lore.kernel.org/qemu-devel/20250415081104.71708-1-shameerali.kolothum.th...@huawei.com/

Addressed feedback on v1. Thanks to all.
1. Retained the same name as the legacy SMMUv3(arm-smmuv3) for new
device type as well (instead of arm-smmuv3-dev type usage in v1).
2. Changes to ACPI IORT to use the same struct SMMUv3Device for both
legacy SMMUv3 and the new SMMUV3 device
3. Removed the restriction of creating SMMUv3 dev if virt ver > 9.2.

Cover letter from v1:

This patch series introduces support for a user-creatable SMMUv3 device
(-device arm-smmuv3) in QEMU.

The implementation is based on feedback received from the RFCv2[0]:
"hw/arm/virt: Add support for user-creatable accelerated SMMUv3"

Currently, QEMU's SMMUv3 emulation (iommu=smmuv3) is tied to the machine

should it be clarified as 'to the machine's sysbus' ?

and does not support instantiating multiple SMMUv3 devices—each associated
with a separate PCIe root complex. In contrast, real-world ARM systems
often include multiple SMMUv3 instances, each bound to a different PCIe
root complex.

This also lays the groundwork for supporting accelerated SMMUv3, as
proposed in the aforementioned RFC. Please note, the accelerated SMMUv3
support is not part of this series and will be sent out as a separate
series later on top of this one.

Summary of changes:

  -Introduces support for multiple -device arm-smmuv3-dev,bus=pcie.x

Drop, as you've done elsewhere.


   instances.

   Example usage:

   -device arm-smmuv3,bus=pcie.0
   -device virtio-net-pci,bus=pcie.0
   -device pxb-pcie,id=pcie.1,bus_nr=2
   -device arm-smmuv3,bus=pcie.1
   -device pcie-root-port,id=pcie.port1,bus=pcie.1
   -device virtio-net-pci,bus=pcie.port1

   -Supports either the legacy iommu=smmuv3 option or the new
   "-device arm-smmuv3" model.> 
   -Adds device tree bindings for the new SMMUv3 device on the arm/virt

machine only, and only for the default pcie.0 root complex.
(Note: pxb-pcie root complexes are currently not supported with the
 device tree due to known limitations[1].)

Please take a look and let me know your feedback.

Thanks,
Shameer
[0]:https://lore.kernel.org/qemu-devel/20250311141045.66620-1-shameerali.kolothum.th...@huawei.com/
[1]:https://lore.kernel.org/qemu-devel/20230421165037.2506-1-jonathan.came...@huawei.com/

Nicolin Chen (1):
   hw/arm/virt: Add an SMMU_IO_LEN macro

Shameer Kolothum (5):
   hw/arm/smmuv3: Add support to associate a PCIe RC
   hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices
   hw/arm/virt: Factor out common SMMUV3 dt bindings code
   hw/arm/virt: Add support for smmuv3 device
   hw/arm/smmuv3: Enable smmuv3 device creation

  hw/arm/smmuv3.c  |  27 +++
  hw/arm/virt-acpi-build.c | 162 +++
  hw/arm/virt.c| 112 ---
  hw/core/sysbus-fdt.c |   3 +
  include/hw/arm/virt.h|   1 +
  5 files changed, 244 insertions(+), 61 deletions(-)






Re: [PATCH v2 4/6] hw/arm/virt: Add an SMMU_IO_LEN macro

2025-05-02 Thread Donald Dutile




On 5/2/25 6:27 AM, Shameer Kolothum wrote:

From: Nicolin Chen 

This is useful as the subsequent support for new SMMUv3 dev will also
use the same.

Signed-off-by: Nicolin Chen 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/virt.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 464e84ae67..e178282d71 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -147,6 +147,9 @@ static void arm_virt_compat_set(MachineClass *mc)
  #define LEGACY_RAMLIMIT_GB 255
  #define LEGACY_RAMLIMIT_BYTES (LEGACY_RAMLIMIT_GB * GiB)
  
+/* MMIO region size for SMMUv3 */

+#define SMMU_IO_LEN 0x2
+
  /* Addresses and sizes of our components.
   * 0..128MB is space for a flash device so we can run bootrom code such as 
UEFI.
   * 128MB..256MB is used for miscellaneous device I/O.
@@ -178,7 +181,7 @@ static const MemMapEntry base_memmap[] = {
  [VIRT_FW_CFG] = { 0x0902, 0x0018 },
  [VIRT_GPIO] =   { 0x0903, 0x1000 },
  [VIRT_UART1] =  { 0x0904, 0x1000 },
-[VIRT_SMMU] =   { 0x0905, 0x0002 },
+[VIRT_SMMU] =   { 0x0905, SMMU_IO_LEN },
  [VIRT_PCDIMM_ACPI] ={ 0x0907, MEMORY_HOTPLUG_IO_LEN },
  [VIRT_ACPI_GED] =   { 0x0908, ACPI_GED_EVT_SEL_LEN },
  [VIRT_NVDIMM_ACPI] ={ 0x0909, NVDIMM_ACPI_IO_LEN},
@@ -1453,7 +1456,6 @@ static void create_smmu(const VirtMachineState *vms,
  int irq =  vms->irqmap[VIRT_SMMU];
  int i;
  hwaddr base = vms->memmap[VIRT_SMMU].base;
-hwaddr size = vms->memmap[VIRT_SMMU].size;
  DeviceState *dev;
  
  if (vms->iommu != VIRT_IOMMU_SMMUV3 || !vms->iommu_phandle) {

@@ -1473,7 +1475,7 @@ static void create_smmu(const VirtMachineState *vms,
  sysbus_connect_irq(SYS_BUS_DEVICE(dev), i,
 qdev_get_gpio_in(vms->gic, irq + i));
  }
-create_smmuv3_dt_bindings(vms, base, size, irq);
+create_smmuv3_dt_bindings(vms, base, SMMU_IO_LEN, irq);
  }
  
  static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)


Reviewed-by: Donald Dutile 




Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC

2025-05-02 Thread Donald Dutile




On 5/2/25 6:27 AM, Shameer Kolothum wrote:

Although this change does not affect functionality at present, it lays
the groundwork for enabling user-created SMMUv3 devices in
future patches

Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmuv3.c | 26 ++
  hw/arm/virt.c   |  3 ++-
  2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index ab67972353..605de9b721 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -24,6 +24,7 @@
  #include "hw/qdev-properties.h"
  #include "hw/qdev-core.h"
  #include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
  #include "cpu.h"
  #include "exec/target_page.h"
  #include "trace.h"
@@ -1874,6 +1875,25 @@ static void smmu_reset_exit(Object *obj, ResetType type)
  smmuv3_init_regs(s);
  }
  
+static int smmuv3_pcie_bus(Object *obj, void *opaque)

+{
+DeviceState *d = opaque;
+PCIBus *bus;
+
+if (!object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
+return 0;
+}
+
+bus = PCI_HOST_BRIDGE(obj)->bus;
+if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus->name)) {
+object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus),
+ &error_abort);
+/* Return non-zero as we got the bus and don't need further 
iteration.*/
+return 1;
+}
+return 0;
+}
+
  static void smmu_realize(DeviceState *d, Error **errp)
  {
  SMMUState *sys = ARM_SMMU(d);
@@ -1882,6 +1902,10 @@ static void smmu_realize(DeviceState *d, Error **errp)
  SysBusDevice *dev = SYS_BUS_DEVICE(d);
  Error *local_err = NULL;
  
+if (!object_property_get_link(OBJECT(d), "primary-bus", &error_abort)) {

+object_child_foreach_recursive(object_get_root(), smmuv3_pcie_bus, d);
+}
+
  c->parent_realize(d, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
@@ -1996,6 +2020,8 @@ static void smmuv3_class_init(ObjectClass *klass, const 
void *data)
  device_class_set_parent_realize(dc, smmu_realize,
  &c->parent_realize);
  device_class_set_props(dc, smmuv3_properties);
+dc->hotpluggable = false;
+dc->bus_type = TYPE_PCIE_BUS;

Does this force legacy SMMUv3 to be tied to a PCIe bus now?
if so, will that break some existing legacy smmuv3 configs?, i.e., virtio-scsi 
attached to a legacy smmuv3.


  }
  
  static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 177f3dd22c..3bae4e374f 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -56,6 +56,7 @@
  #include "qemu/cutils.h"
  #include "qemu/error-report.h"
  #include "qemu/module.h"
+#include "hw/pci/pci_bus.h"
  #include "hw/pci-host/gpex.h"
  #include "hw/virtio/virtio-pci.h"
  #include "hw/core/sysbus-fdt.h"
@@ -1442,7 +1443,7 @@ static void create_smmu(const VirtMachineState *vms,
  }
  object_property_set_link(OBJECT(dev), "primary-bus", OBJECT(bus),
   &error_abort);
-sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), &error_fatal);
+qdev_realize_and_unref(dev, &bus->qbus, &error_fatal);
  sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, base);
  for (i = 0; i < NUM_SMMU_IRQS; i++) {
  sysbus_connect_irq(SYS_BUS_DEVICE(dev), i,





Re: [PATCH v2 2/6] hw/arm/virt-acpi-build: Update IORT for multiple smmuv3 devices

2025-05-02 Thread Donald Dutile




On 5/2/25 1:13 PM, Nicolin Chen wrote:

On Fri, May 02, 2025 at 11:27:03AM +0100, Shameer Kolothum wrote:

@@ -43,6 +43,7 @@
  #include "hw/acpi/generic_event_device.h"
  #include "hw/acpi/tpm.h"
  #include "hw/acpi/hmat.h"
+#include "hw/arm/smmuv3.h"
  #include "hw/pci/pcie_host.h"
  #include "hw/pci/pci.h"
  #include "hw/pci/pci_bus.h"
@@ -266,6 +267,75 @@ static int iort_idmap_compare(gconstpointer a, 
gconstpointer b)
  return idmap_a->input_base - idmap_b->input_base;
  }
  
+struct SMMUv3Device {

+int irq;
+hwaddr base;
+GArray *smmu_idmaps;
+size_t offset;
+};
+typedef struct SMMUv3Device SMMUv3Device;


"SMMUv3Device" sounds too general, like coming from smmuv3.h :-/

Given this describes SMMUv3's IORT node, I still think we should
name it something like "IortSMMUv3Node" or so.


+1.. the more generic name had me thinking it was broader than IORT..
the IORT-related naming is an improvement.


+static int smmuv3_dev_idmap_compare(gconstpointer a, gconstpointer b)
+{
+SMMUv3Device *sdev_a = (SMMUv3Device *)a;
+SMMUv3Device *sdev_b = (SMMUv3Device *)b;
+AcpiIortIdMapping *map_a = &g_array_index(sdev_a->smmu_idmaps,
+  AcpiIortIdMapping, 0);
+AcpiIortIdMapping *map_b = &g_array_index(sdev_b->smmu_idmaps,
+  AcpiIortIdMapping, 0);
+return map_a->input_base - map_b->input_base;
+}
+
+static void
+get_smmuv3_legacy_dev(VirtMachineState *vms, GArray * smmuv3_devices)


GArray *smmuv3_devices

Or maybe align with the non-legacy path, i.e. "sdev_blob"? Or the
other way around.

Otherwise, lgtm
Reviewed-by: Nicolin Chen 






Re: [PATCH 0/5] Add support for user creatable SMMUv3 device

2025-04-28 Thread Donald Dutile




On 4/22/25 4:57 AM, Shameerali Kolothum Thodi wrote:



-Original Message-
From: Donald Dutile 
Sent: Friday, April 18, 2025 9:34 PM
To: Shameerali Kolothum Thodi
; qemu-...@nongnu.org;
qemu-devel@nongnu.org
Cc: eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
nicol...@nvidia.com; berra...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [PATCH 0/5] Add support for user creatable SMMUv3 device

Shameer,
Hi!

First off, like the partitioning of these pieces.

On 4/15/25 4:10 AM, Shameer Kolothum wrote:

Hi All,

This patch series introduces support for a user-creatable SMMUv3 device
(-device arm-smmuv3-dev) in QEMU.


Can we drop the '-dev', as 'arm-smmu' is sufficient, as is '-device'?

I know, internal to QEMU, there's already an ARM_SMMU, but as suggested
later,
if it uses the same struct, the qemu cmdline syntax is a bit less redundant.


Trust me I tried that😊. The problem is that, the legacy one doesn't have a bus
associated with it and the moment we change that and have bus specified for the
  "-device arm-smmuv3, bus=pcie.0" the legacy smmuv3 creation in virt,

create_smmu() --> qdev_new(TYPE_ARM_SMMUV3)

will fails as it expects the bus type to be specified for it. I couldn't find a 
way
to solve that.


I guessed that's why there was new syntax for what is basically an extension of 
previous syntax.
I'll dig more into it and see if there's a way to handle it.
qemu does handle varying added params (id=blah, no id=blah),
so I'm researching how that's done, to figure out how the legacy, smmu-for-all,
and the smmu-for-specific pcie can both be supported.
Given your stated trials and tribulations, this should be fun. ;-)
- Don




The implementation is based on feedback received from the RFCv2[0]:
"hw/arm/virt: Add support for user-creatable accelerated SMMUv3"

Currently, QEMU's SMMUv3 emulation (iommu=smmuv3) is tied to the

machine

and does not support instantiating multiple SMMUv3 devices—each

associated

with a separate PCIe root complex. In contrast, real-world ARM systems
often include multiple SMMUv3 instances, each bound to a different PCIe
root complex.

This also lays the groundwork for supporting accelerated SMMUv3, as
proposed in the aforementioned RFC. Please note, the

accelerated SMMUv3

support is not part of this series and will be sent out as a separate
series later on top of this one.

Summary of changes:

   -Introduces support for multiple -device arm-smmuv3-dev,bus=pcie.x
    instances.

    Example usage:

    -device arm-smmuv3-dev,bus=pcie.0
    -device virtio-net-pci,bus=pcie.0
    -device pxb-pcie,id=pcie.1,bus_nr=2
    -device arm-smmuv3-dev,bus=pcie.1
    -device pcie-root-port,id=pcie.port1,bus=pcie.1
    -device virtio-net-pci,bus=pcie.port1

    -Supports either the legacy iommu=smmuv3 option or the new
    "-device arm-smmuv3-dev" model.


nice! :)
Can it support both? i.e., some devices using a system-wide, legacy
smmuv3,
and some pcie devices using the -device arm-smmuv3 ?


Please see my reply to patch #2. In short No, it doesn't support both.

Also I think we should slowly deprecate the use of legacy SMMUv3 going
forward unless there is a strong use case/reason to support it.


If not, then add a check to min-warn or max-fail, that config.
I can see old machines being converted/upgraded to new machines,
and they will want to switch from legacy->device smmuv3, and catching
an edit/update to a machine change (or enabling automated conversion)
would be helpful.


Please see Patch #4. It already checks and prevents if incompatible SMMUv3
types are specified. Or is the suggestion here is to add something extra?
Please let me know.

Thanks,
Shameer






Re: [PATCH] pci-ids.rst: Add Red Hat pci-id for AMD IOMMU device

2025-03-04 Thread Donald Dutile

Hi Suravee!

Not your issue, but wondering if others know:

Why isn't this an issue for Intel-vtd-iommu & ARM-SMMUV3 ?
Are they instantiated as non-PCI-id (platform) devices, but AMD puts their 
IOMMU in PCI space?

Adv. thanks for the info.

- Don


On 3/4/25 1:37 PM, Suravee Suthikulpanit wrote:

The QEMU-emulated AMD IOMMU PCI device is implemented based on the AMD I/O
Virtualization Technology (IOMMU) Specification [1]. The PCI id for this
device is platform-specific.

Currently, the QEMU-emulated AMD IOMMU device is using AMD vendor id and
undefined device id.

Therefore, change the vendor id to Red Hat and request a new QEMU-specific
device id.

[1] 
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/specifications/48882_IOMMU.pdf

Cc: Gerd Hoffmann 
Signed-off-by: Suravee Suthikulpanit 
---
  docs/specs/pci-ids.rst | 2 ++
  hw/i386/amd_iommu.c| 3 ++-
  include/hw/pci/pci.h   | 1 +
  3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst
index 261b0f359f..2416a70a2d 100644
--- a/docs/specs/pci-ids.rst
+++ b/docs/specs/pci-ids.rst
@@ -100,6 +100,8 @@ PCI devices (other than virtio):
PCI UFS device (``-device ufs``)
  1b36:0014
PCI RISC-V IOMMU device
+1b36:0015
+  PCI AMD IOMMU device (``-device amd-iommu``)
  
  All these devices are documented in :doc:`index`.
  
diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c

index dda1a5781f..4d8564249c 100644
--- a/hw/i386/amd_iommu.c
+++ b/hw/i386/amd_iommu.c
@@ -1766,7 +1766,8 @@ static void amdvi_pci_class_init(ObjectClass *klass, void 
*data)
  DeviceClass *dc = DEVICE_CLASS(klass);
  PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
  
-k->vendor_id = PCI_VENDOR_ID_AMD;

+k->vendor_id = PCI_VENDOR_ID_REDHAT;
+k->device_id = PCI_DEVICE_ID_REDHAT_AMD_IOMMU;
  k->class_id = 0x0806;
  k->realize = amdvi_pci_realize;
  
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h

index 4002bbeebd..da44e6673d 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -117,6 +117,7 @@ extern bool pci_available;
  #define PCI_DEVICE_ID_REDHAT_ACPI_ERST   0x0012
  #define PCI_DEVICE_ID_REDHAT_UFS 0x0013
  #define PCI_DEVICE_ID_REDHAT_RISCV_IOMMU 0x0014
+#define PCI_DEVICE_ID_REDHAT_AMD_IOMMU   0x0015
  #define PCI_DEVICE_ID_REDHAT_QXL 0x0100
  
  #define FMT_PCIBUS  PRIx64





Re: [PATCH] pci-ids.rst: Add Red Hat pci-id for AMD IOMMU device

2025-03-05 Thread Donald Dutile




On 3/4/25 9:56 PM, Suthikulpanit, Suravee wrote:



On 3/5/2025 6:02 AM, Donald Dutile wrote:

Hi Suravee!

Not your issue, but wondering if others know:

Why isn't this an issue for Intel-vtd-iommu & ARM-SMMUV3 ?
Are they instantiated as non-PCI-id (platform) devices, but AMD puts their 
IOMMU in PCI space?

Adv. thanks for the info.


Unlike AMD IOMMU, Intel VT-d IOMMU and ARM SMMUV3 are not enumerated as a PCI 
device in the system.

Thanks,
Suravee


ok. thanks for confirmation.
--dd




Re: [RFC PATCH v2 03/20] hw/arm/smmuv3-accel: Add initial infrastructure for smmuv3-accel device

2025-04-04 Thread Donald Dutile




On 3/19/25 1:00 PM, Eric Auger wrote:

Hi,


On 3/19/25 1:23 AM, Jason Gunthorpe wrote:

On Tue, Mar 18, 2025 at 05:22:51PM -0400, Donald Dutile wrote:


I agree with Eric that 'accel' isn't needed -- this should be
ascertained from the pSMMU that a physical device is attached to.

I seem to remember the point was made that we don't actually know if
accel is possible, or desired, especially in the case of hotplug.

that's why I think it would be better if we could instantiate a single
type of device that can do both accel and non accel mode.
Maybe that would be at the price of always enforcing MSI resv regions on
guest to assure MSI nesting is possible.



The accelerated mode has a number of limitations that the software
mode does not have. I think it does make sense that the user would
deliberately choose to use a more restrictive operating mode and then
would have to meet the requirements - eg by creating the required
number and configuration of vSMMUs.

To avoid any misunderstanding I am not pushing for have a single vSMMU
instance. I advocate for having several instances, each somehow
specialized for VFIO devices or emulated devices. Maybe we can opt-in
with accel=on but the default could be auto (the property can be
AUTO_ON_OFF) where the code detects if a VFIO device is translated.In
case incompatible devices are translated into a same vSMMU instance I
guess it could be detected and will fail.

What I am pusshing for is to have a single type of QEMU device which can
do both accel and non accel.

+1 !


In general I advocate for having several vSMMU instances, each of them


Now... how does vfio(?; why not qemu?) layer determine that? --
where are SMMUv3 'accel' features exposed either: a) in the device
struct (for the smmuv3) or (b) somewhere under sysfs? ... I couldn't
find anything under either on my g-h system, but would appreciate a
ptr if there is.

I think it is not discoverable yet other thatn through
try-and-fail. Discoverability would probably be some bits in an
iommufd GET_INFO ioctl or something like that.

yeah but at least we can easily detect if a VFIO device is beeing
translated by a vSMMU instance in which case there is no other choice to
turn accel on.

Thanks

Eric



and like Eric, although 'accel' is better than the
original 'nested', it's non-obvious what accel feature(s) are being
turned on, or not.

There are really only one accel feature - direct HW usage of the IO
Page table in the guest (no shadowing).

A secondary addon would be direct HW usage of an invalidation queue in
the guest.


kernel boot-param will be needed; if in sysfs, a write to 0 an
enable(disable) it maybe an alternative as well.  Bottom line: we
need a way to (a) ascertain the accel feature (b) a way to disable
it when it is broken, so qemu's smmuv3 spec will 'just work'.

You'd turned it off by not asking qemu to use it, that is sort of the
reasoning behind the command line opt in for accel or not.

Jason








Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC

2025-05-08 Thread Donald Dutile




On 5/7/25 4:50 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Markus Armbruster 
Sent: Wednesday, May 7, 2025 8:17 AM
To: Donald Dutile 
Cc: Shameer Kolothum via ; qemu-
a...@nongnu.org; Shameerali Kolothum Thodi
; eric.au...@redhat.com;
peter.mayd...@linaro.org; j...@nvidia.com; nicol...@nvidia.com;
berra...@redhat.com; nath...@nvidia.com; mo...@nvidia.com;
smost...@google.com; Linuxarm ; Wangzhou (B)
; jiangkunkun ;
Jonathan Cameron ;
zhangfei@linaro.org
Subject: Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a
PCIe RC

Donald Dutile  writes:

[...]


In this series, an iommu/smmu needs to be placed -BETWEEN- a sysbus

and a PCIe-tree,

or step-wise, plug an smmuv3 into a sysbus, and a pcie tree/domain/RC

into an SMMUv3.

RC = root complex?


Yes.


+1.




So, an smmu needs to be associated with a bus (tree), i.e., pcie.0, pcie.1...
One could model it as a PCIe device, attached at the pcie-RC ... but that's

not how it's modelled in ARM hw.

Physical ARM hardware?


yes, physical hw.


Assuming the virtual devices and buses we're discussing model physical
devices and buses:

* What are the physical devices of interest?

* How are they wired together?  Which of the wires are buses, in
   particular PCI buses?


SMMUv3 is a platform device and its placement in a system is typically as below
for PCI devices,

+--+
|   PCIe Devices   |
+--+
  |
  v
   +-+  +---+
   |  PCIe RC A  |<>| Interconnect  |
   +-+  +---+
|
|
 +--v---+
 | SMMUv3.A |
 | (IOMMU)  |
 ++-+
  |
  v
  +---++
  |   System RAM   |
  ++
  
This patch is attempting to establish that association between the PCIe RC and

the SMMUv3 device so that Qemu can build the ACPI tables/DT iommu mappings
for the SMMUv3 device.


I would refer to the ARM SMMU spec, Figure 2.3 in the G.a version, where
it's slightly different; more like:

 +--+
 |   PCIe Devices   | (one device, unless a PCIe switch is btwn the RC & 
'Devices';
 +--+   or, see more typical expansion below)
   |
+-+
|  PCIe RC A  |
+-+
   |
+--v---++---+
| SMMUv3.A || Wide assortment of other platform |
| (IOMMU)  ||   devices not using SMMU  |
++-++---+
 |  |   |   |
  +--+--+---+---+-+
  | System Interconnect   |
  +---+
   |
  +---++ +-+-+
  |   System RAM   |<--->| CPU (NUMA socket) |
  ++ +---+

In fact, the PCIe can be quite complex with PCIe bridges, and multiple Root 
Ports (RP's),
and multiple SMMU's:

+--+   +--+   +--+
| PCIe Device  |   | PCIe Device  |   | PCIe Device  |
+--+   +--+   +--+
  |  |  |<--- PCIe bus
 +--+   +--+  +--+
 | PCIe RP  |   | PCIe RP  |  | PCIe RP  |  <- may be PCI Bridge, 
may not
 +--+   +--+  +--+
 |  |  |
 +--+   +--+   +--+
 |  SMMU|   |  SMMU|   |  SMMU|
 +--+   +--+   +--+
 |  |  |   <- may be a bus, may 
not(hidden from OS)
 +--+--+
|
+--+
|  PCI RC A|
+--+
 
where PCIe RP's could be represented (even virtually) in -hw-

as a PCIe bridge, each downstream being a different PCIe bus under
a single PCIe RC (A, in above pic) -domain-.
... or the RPs don't have to have a PCIe bridge, and look like
'just an RP' that provides a PCIe (pt-to-pt, serial) bus, provided
by a PCIe RC. ... the PCIe architecture allows both, and I've seen
both implementations in hw (at least from an lspci perspective).

You can see the above hw implementation by doing an lspci on most
PCIe systems (definitely common on x86), where the RP's are represented
by 'PCIe bridge' elements -- and lots of them.
In real hw, these RP's effectively become (multiple) up 

Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC

2025-05-06 Thread Donald Dutile




On 5/6/25 7:47 AM, Markus Armbruster wrote:

Shameer Kolothum via  writes:


Although this change does not affect functionality at present, it lays
the groundwork for enabling user-created SMMUv3 devices in
future patches

Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmuv3.c | 26 ++
  hw/arm/virt.c   |  3 ++-
  2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index ab67972353..605de9b721 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -24,6 +24,7 @@
  #include "hw/qdev-properties.h"
  #include "hw/qdev-core.h"
  #include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
  #include "cpu.h"
  #include "exec/target_page.h"
  #include "trace.h"
@@ -1874,6 +1875,25 @@ static void smmu_reset_exit(Object *obj, ResetType type)
  smmuv3_init_regs(s);
  }
  
+static int smmuv3_pcie_bus(Object *obj, void *opaque)

+{
+DeviceState *d = opaque;
+PCIBus *bus;
+
+if (!object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
+return 0;
+}
+
+bus = PCI_HOST_BRIDGE(obj)->bus;
+if (d->parent_bus && !strcmp(bus->qbus.name, d->parent_bus->name)) {
+object_property_set_link(OBJECT(d), "primary-bus", OBJECT(bus),
+ &error_abort);
+/* Return non-zero as we got the bus and don't need further 
iteration.*/
+return 1;
+}
+return 0;
+}
+
  static void smmu_realize(DeviceState *d, Error **errp)
  {
  SMMUState *sys = ARM_SMMU(d);
@@ -1882,6 +1902,10 @@ static void smmu_realize(DeviceState *d, Error **errp)
  SysBusDevice *dev = SYS_BUS_DEVICE(d);
  Error *local_err = NULL;
  
+if (!object_property_get_link(OBJECT(d), "primary-bus", &error_abort)) {

+object_child_foreach_recursive(object_get_root(), smmuv3_pcie_bus, d);
+}
+
  c->parent_realize(d, &local_err);
  if (local_err) {
  error_propagate(errp, local_err);
@@ -1996,6 +2020,8 @@ static void smmuv3_class_init(ObjectClass *klass, const 
void *data)
  device_class_set_parent_realize(dc, smmu_realize,
  &c->parent_realize);
  device_class_set_props(dc, smmuv3_properties);
+dc->hotpluggable = false;
+dc->bus_type = TYPE_PCIE_BUS;


This is very, very wrong.

The function serves as .class_init() for QOM type "arm-smmuv3", defined
as:

static const TypeInfo smmuv3_type_info = {
.name  = TYPE_ARM_SMMUV3,
.parent= TYPE_ARM_SMMU,

Subtype of "arm-smmuv3".

.instance_size = sizeof(SMMUv3State),
.instance_init = smmuv3_instance_init,
.class_size= sizeof(SMMUv3Class),
.class_init= smmuv3_class_init,
};


 static const TypeInfo smmu_base_info = {
 .name  = TYPE_ARM_SMMU,
 .parent= TYPE_SYS_BUS_DEVICE,

Subtype of "sys-bus-device".

 .instance_size = sizeof(SMMUState),
 .class_data= NULL,
 .class_size= sizeof(SMMUBaseClass),
 .class_init= smmu_base_class_init,
 .abstract  = true,
 };

Have a look at the instance struct:

struct SMMUv3State {
SMMUState smmu_state;

Starts with the supertype's instance struct, as is proper.

uint32_t features;
[more ...]
};

Here's the supertype's instance struct:

struct SMMUState {
/*  */
SysBusDevice  dev;

Again, starts with the supertype's instance struct.

const char *mrtypename;
[more...]
};

This is a sysbus device, not a PCI device.  Monkey-patching dc->bus_type
from TYPE_SYSTEM_BUS to TYPE_PCIE_BUS won't change that.  All it
accomplishes is making the qdev core believe it plugs into a PCIE bus.
This can only end in tears.

In fact, when I build with the entire series applied (so the device can
actually be used with -device), the result dies within seconds of my ad
hoc testing:

 $ qemu-system-aarch64 -nodefaults -S -display none -monitor stdio -M virt 
-device pxb-pcie,id=pcie.1,bus_nr=2 -device arm-smmuv3,bus=pcie.1
 QEMU 10.0.50 monitor - type 'help' for more information
 qemu-system-aarch64: -device arm-smmuv3,bus=pcie.1: warning: SMMUv3 device 
only supported with pcie.0 for DT
 (qemu) info qtree
 bus: main-system-bus
   type System
   dev: pxb-host, id ""
 x-config-reg-migration-enabled = true
 bypass-iommu = false
 bus: pcie.1
   type pxb-pcie-bus
   dev: arm-smmuv3, id ""
 gpio-out "sysbus-irq" 4
 stage = "nested"
 bus_num = 0 (0x0)
 Segmentation fault (core dumped)

Backtrace:

 #0  0x557d8521 in pcibus_dev_print
 (mon=0x5826d0e0, dev=0x590ad360, indent=8)
 at ../hw/pci/pci-hmp-cmds.c:140
 #1  0x55eac0a0 in bus_print_dev
 (bus=, mon=, dev=0x590ad360, 
indent=8)
 at ../system/qdev-monitor.c:773
 #2  qdev_print (mon=, dev=, indent=8)
 at ../system/q

Re: [PATCH v2 1/6] hw/arm/smmuv3: Add support to associate a PCIe RC

2025-05-16 Thread Donald Dutile




On 5/8/25 9:57 AM, Peter Maydell wrote:

On Thu, 8 May 2025 at 14:46, Donald Dutile  wrote:


I would refer to the ARM SMMU spec, Figure 2.3 in the G.a version, where
it's slightly different; more like:

   +--+
   |   PCIe Devices   | (one device, unless a PCIe switch is btwn the RC & 
'Devices';
   +--+   or, see more typical expansion below)
 |
  +-+
  |  PCIe RC A  |
  +-+
 |
  +--v---++---+
  | SMMUv3.A || Wide assortment of other platform |
  | (IOMMU)  ||   devices not using SMMU  |
  ++-++---+
   |  |   |   |
+--+--+---+---+-+
| System Interconnect   |
+---+
 |
+---++ +-+-+
|   System RAM   |<--->| CPU (NUMA socket) |
++ +---+

In fact, the PCIe can be quite complex with PCIe bridges, and multiple Root 
Ports (RP's),
and multiple SMMU's:

  +--+   +--+   +--+
  | PCIe Device  |   | PCIe Device  |   | PCIe Device  |
  +--+   +--+   +--+
|  |  |<--- PCIe bus
   +--+   +--+  +--+
   | PCIe RP  |   | PCIe RP  |  | PCIe RP  |  <- may be PCI Bridge, 
may not
   +--+   +--+  +--+
   |  |  |
   +--+   +--+   +--+
   |  SMMU|   |  SMMU|   |  SMMU|
   +--+   +--+   +--+
   |  |  |   <- may be a bus, may 
not(hidden from OS)
   +--+--+
  |
  +--+
  |  PCI RC A|
  +--+





The final take away: the (QEMU) SMMU/IOMMU must be associated with a PCIe bus
OR, the format has to be something like:
-device smmuv3, id=smmuv3.1
-device , smmu=smmuv3.1
where the device <-> SMMU (or if extended to x86, iommu) associativity is set 
w/o bus associativity.


The problem here seems to me to be that in the hardware we're
modelling the SMMU always exists, because it's in the SoC,
but you're trying to arrange for it to be created on the
command line, via -device.

We don't have any of these problems with the current 'virt'
board code, because we have the board code create the iommu
(if the user asks for it via the iommu machine property),
and it can wire it up to the PCI root complex as needed.

thanks
-- PMM


Peter,
Hey!
Learning more about the use-of/focus/expectation of '-device', I see your point.
The issue is that there isn't just one SMMU in a system (as the diagram above 
illustrates), they have associativity to
(pcie) devices (possibly more accurately, a pcie-(sub) tree) (for the focus of 
this discussion; could be non-pcie devices),
and we need an option format to associate smmu to pcie(domain/subtree(RPs), 
devices), and that typically references
busses (pxbs?) &/or pcie devices (-device). ... or do we only need to add iommu 
associativity to devices,
 ie, the second option line above, -device , iommu=smmuv3.1 ... which 
should enable the right IORT(s) to be
made for multiple smmus/iommus.

- Don





Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device

2025-05-20 Thread Donald Dutile




On 5/20/25 5:13 AM, Duan, Zhenzhong wrote:




-Original Message-
From: Donald Dutile 
Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
passthrough device

Hey Zhenzhong,
Thanks for feedback. replies below.
- Don

On 5/19/25 4:37 AM, Duan, Zhenzhong wrote:

Hi Donald,


-Original Message-
From: Donald Dutile 
Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
passthrough device

Zhenzhong,

Hi!
Eric asked me to review this series.
Since it's rather late since you posted will summarize review feedback
below/bottom.

- Don

On 2/19/25 3:22 AM, Zhenzhong Duan wrote:

...


Did you ever put some tracing in to capture avg hits in cache? ... 
if so,

add

as a comment.
Otherwise, looks good.

Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative

to 2-

stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the
rest of this patch doesn't make any sense to me without the background.
The patch introduces hw-info-type (none or intel), and then goes on to add a
large set of checks; seems like the caps & this checking should go together

(split

for each cap; put all caps together & the check...).


OK, will do. There are some explanations in cover-letter.
For history reason, old vtd spec define stage-1 as first level then switch to 
first

stage.



So 'flts' is 'first level then switch' .


Sorry for confusion, it stands for 'first level translation support'.


Thanks.





Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-

specif

element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?

e.g.

void *hostiommustate;


Yes, that's possible, but we want to make a generic interface between

VFIO/VDPA and vIOMMU.



ok. I don't understand how VFIO & VPDA complicate that add.


IIUC, the hostiommustate provided by VFIO and VDPA may be different format.
By using a general interface like .get_cap(), we hide the resolving under VFIO 
and
VDPA backend. This is like the KVM extension checking between QEMU and KVM.

FYI, there was some discuss on the interface before,
see https://lists.gnu.org/archive/html/qemu-devel/2024-04/msg02658.html


Good analogy, thanks. I'll reach out to Cedric on the above discussion as well.





Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each
PASID have its own address space?


Yes, it is.


So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe I'm
thinking too SMMU-StreamID, which can be varying length, depending on
subsystem support.  I see what appears to be sid+pasid calls to do the AS

lookups;

hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a
better description stating this transition to a wider stream-id would set the

code

context better.


Not quite get..


I'm looking for a better description that states the AS cache lookup is 
broadened
from bdf
to bdf+pasid.


Guess you mean vtd_as_from_iommu_pasid(), it's a variant of vtd_find_add_as().
We support AS cache lookup by bdf+pasid for a long time, see vtd_find_add_as().


Thanks for clarification.




As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel-

iommu

to determine if its all correct or not.

Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says

it

uses a 2-stage page table in each case, but it implies it should be different.  
At

580

lines of code changes, you win! ;-)


The host side's using nested or only stage-2 page table depends on PGTT's

setting in guest.



Thanks for clarification.



Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that

an

intel-iommu requriement?
At least this intel-iommu-errata code is only in hw/i386/<> modules.


No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a

new VTDIOASContainer to hold only read/write areas mapping.

We can use same IOMMUFDs for different VTDIOASContainer.


ah yes; I got hung-up on different mappings, and didn't back up to AS-container
split & same IOMMUFD.



Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV",
"CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?


VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation',
VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'.


That explanation helps. :)  maybe put a short blurb in the commit log, or code,
so one doesn't have to be a ninja-VTD spec consumer to comprehend those
(important) diffs.


Good idea, will do.




Thanks
Zhenzhong


Again, thanks for the reply.
Looking fwd to the rfcv3 (on list) or move to v1-POST.


Thanks for your comments!

BRs,
Zhenzhong


Thanks for the added clarifications.
- Don




Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device

2025-05-19 Thread Donald Dutile

Hey Zhenzhong,
Thanks for feedback. replies below.
- Don

On 5/19/25 4:37 AM, Duan, Zhenzhong wrote:

Hi Donald,


-Original Message-
From: Donald Dutile 
Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
passthrough device

Zhenzhong,

Hi!
Eric asked me to review this series.
Since it's rather late since you posted will summarize review feedback
below/bottom.

- Don

On 2/19/25 3:22 AM, Zhenzhong Duan wrote:

Hi,

Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.

This series is 2nd part focusing on passthrough device. We don't do
shadowing of guest page table for passthrough device but pass stage-1
page table to host side to construct a nested domain. There was some
effort to enable this feature in old days, see [2] for details.

The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

  .-.  .---.
  |   vIOMMU|  | Guest I/O page table  |
  | |  '---'
  ./
  | PASID Entry |--- PASID cache flush --+
  '-'|
  | |V
  | |   I/O page table pointer in GPA
  '-'
  Guest
  --| Shadow |---|
vv   v
  Host
  .-.  ..
  |   pIOMMU|  |  FS for GIOVA->GPA |
  | |  ''
  ./  |
  | PASID Entry | V (Nested xlate)
  '\.--.
  | |   | SS for GPA->HPA, unmanaged domain|
  | |   '--'
  '-'
Where:
   - FS = First stage page tables
   - SS = Second stage page tables



I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second 
stage.
We don't need different terms for the same technology in the iommu/iommufd
space(s).


OK, then I'd like to use stage1 and stage2 everywhere which is more verbose.


your choice; in other kernel & qemu code I've seen, 's1' and 's2' are used 
quite frequently,
that's why I recommended it -- call it plagerizing! ;-)




There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
subsystem. VFIO calls them to register/unregister HostIOMMUDevice
instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
to bind/unbind device to IOMMUFD backed domains, either nested
domain or not.

See below diagram:

  VFIO Device Intel IOMMU
  .-. .---.
  | | |   |
  |   .-|PCIIOMMUOps  |.-.|
  |   | IOMMUFD |(set_iommu_device)   || Host IOMMU  ||
  |   | Device  |>|| Device list ||
  |   .-|(unset_iommu_device) |.-.|
  | | |   |   |
  | | |   V   |
  |   .-|  HostIOMMUDeviceIOMMUFD |  .-.  |
  |   | IOMMUFD |(attach_hwpt)|  | Host IOMMU  |  |
  |   | link|<|  |   Device|  |
  |   .-|(detach_hwpt)|  .-.  |
  | | |   |   |
  | | |   ... |
  .-. .---.

Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.

E.g., Stage-2 page table could be shared by different devices if there
is no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there is

and &#x

Re: [PATCH] hw/arm/virt: Check bypass iommu is not set for iommu-map DT property

2025-06-02 Thread Donald Dutile




On 6/2/25 7:46 AM, Shameer Kolothum wrote:

default_bus_bypass_iommu tells us whether the bypass_iommu is set
for the default PCIe root bus. Make sure we check that before adding
the "iommu-map" DT property.

Fixes: 6d7a85483a06 ("hw/arm/virt: Add default_bus_bypass_iommu machine option")
Suggested-by: Eric Auger 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/virt.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 9a6cd085a3..99fde5836c 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1487,9 +1487,12 @@ static void 
create_virtio_iommu_dt_bindings(VirtMachineState *vms)
  qemu_fdt_setprop_cell(ms->fdt, node, "phandle", vms->iommu_phandle);
  g_free(node);
  
-qemu_fdt_setprop_cells(ms->fdt, vms->pciehb_nodename, "iommu-map",

-   0x0, vms->iommu_phandle, 0x0, bdf,
-   bdf + 1, vms->iommu_phandle, bdf + 1, 0x - bdf);
+if (!vms->default_bus_bypass_iommu) {
+qemu_fdt_setprop_cells(ms->fdt, vms->pciehb_nodename, "iommu-map",
+   0x0, vms->iommu_phandle, 0x0, bdf,
+   bdf + 1, vms->iommu_phandle, bdf + 1,
+   0x - bdf);
+}
  }
  
  static void create_pcie(VirtMachineState *vms)

@@ -1612,8 +1615,10 @@ static void create_pcie(VirtMachineState *vms)
  switch (vms->iommu) {
  case VIRT_IOMMU_SMMUV3:
  create_smmu(vms, vms->bus);
-qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map",
-   0x0, vms->iommu_phandle, 0x0, 0x1);
+if (!vms->default_bus_bypass_iommu) {
+qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map",
+   0x0, vms->iommu_phandle, 0x0, 0x1);
+}
  break;
  default:
  g_assert_not_reached();


Reviewed-by: Donald Dutile 




Re: [PATCH v4 7/7] qemu-options.hx: Document the arm-smmuv3 device

2025-06-16 Thread Donald Dutile




On 6/16/25 7:12 AM, Jonathan Cameron wrote:

On Fri, 13 Jun 2025 15:44:49 +0100
Shameer Kolothum  wrote:


Now that arm,virt can have user-creatable smmuv3 devices, document it.

Signed-off-by: Shameer Kolothum 
---
  qemu-options.hx | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/qemu-options.hx b/qemu-options.hx
index 7eb8e02b4b..3edbde45bb 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1226,6 +1226,12 @@ SRST
  ``aw-bits=val`` (val between 32 and 64, default depends on machine)
  This decides the address width of the IOVA address space.
  
+``-device arm-smmuv3,primary-bus=id``

+This is only supported by ``-machine virt`` (ARM).
+
+``primary-bus=id``
+The PCIe Root Complex to be associated with.


Hmm.  Root complex or host bridge?
I think an RC is allowed to have multiple heirarchy and hence multiple
host bridges. Figure 1.2 in the PCI spec. So my gut feeling is this
should be host bridge.


+1.
the key word-hint: 'complex' -- a RP (a Root *Port*) can only host a single 
PCI(e) (sub-)tree,
but a RC can have multiple PCI domains, not to mention a bunch of 
platform-level,
acpi-defined PCI(e) devices.





+
  ERST
  
  DEF("name", HAS_ARG, QEMU_OPTION_name,







Re: [PATCH v4 1/7] hw/arm/smmu-common: Check SMMU has PCIe Root Complex association

2025-06-17 Thread Donald Dutile




On 6/17/25 12:52 PM, Jonathan Cameron wrote:

On Tue, 17 Jun 2025 09:49:54 +0200
Eric Auger  wrote:


On 6/16/25 12:20 PM, Jonathan Cameron wrote:

On Fri, 13 Jun 2025 15:44:43 +0100
Shameer Kolothum  wrote:
  

Although this change does not affect functionality at present, it is

Patch title says PCIe.  This check is vs PCI host bridge.

No idea which one you wanted, but if it is PCIe needs to be
TYPC_PCIE_HOST_BRIDGE from pcie_host.h not the pci_host.h one
I think.

I think we need TYPE_PCI_HOST_BRIDGE as we want to check against pxb

pci-bridge/pci_expander_bridge.c:    .parent    = TYPE_PCI_HOST_BRIDGE,


Hmm. That's awkward and I'd forgotten that wrinkle.
Need a stronger test but which one?  The PXB root bus has a parent of
TYPE_PCIE_BUS.  Maybe we can check that?


Yeah, rooted in the problem that we should have pci-switches or pci-rp's,
not just 'pxb'.






Besides the commit title/desc suggested by Jonathan you may also refine
the error msg which can stem now from 2 different causes

Thanks

Eric
  

required when we add support for user-creatable SMMUv3 devices in
future patches.

Tested-by: Nathan Chen 
Signed-off-by: Shameer Kolothum 
---
  hw/arm/smmu-common.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index f39b99e526..7890aa12c1 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -20,6 +20,7 @@
  #include "trace.h"
  #include "exec/target_page.h"
  #include "hw/core/cpu.h"
+#include "hw/pci/pci_bridge.h"
  #include "hw/qdev-properties.h"
  #include "qapi/error.h"
  #include "qemu/jhash.h"
@@ -937,7 +938,8 @@ static void smmu_base_realize(DeviceState *dev, Error 
**errp)
   g_free, g_free);
  s->smmu_pcibus_by_busptr = g_hash_table_new(NULL, NULL);
  
-if (s->primary_bus) {

+if (s->primary_bus && object_dynamic_cast(OBJECT(s->primary_bus)->parent,
+  TYPE_PCI_HOST_BRIDGE)) {
  pci_setup_iommu(s->primary_bus, &smmu_ops, s);
  } else {
  error_setg(errp, "SMMU is not attached to any PCI bus!");










Re: [PATCH v3 0/6] hw/arm/virt: Add support for user creatable SMMUv3 device

2025-06-04 Thread Donald Dutile




On 6/4/25 10:02 PM, Nathan Chen wrote:

On 6/2/2025 8:41 AM, Shameer Kolothum wrote:

This patch series introduces support for a user-creatable SMMUv3 device
(-device arm-smmuv3) in QEMU.


Tested-by: Nathan Chen 

I am able to create 16 SMMUv3 devices in a qemu VM with emulated devices 
properly associated with the guest SMMUs in guest sysfs - verified with some 
guest SMMUs having two or three emulated NICs assigned to them while other 
guest SMMUs have a minimum of one assigned.

Thanks,
Nathan


Nathan,
Great test!
Can you share the xml &/or qemu cmdline, to demonstrate this broad capability?
Also would be a good reference when (we know it isn't 'if') a new/different 
config
that wasn't tested shows a bug, and we'll have more info to work with.
Thanks,
 Don




Re: [PATCH v3 0/6] hw/arm/virt: Add support for user creatable SMMUv3 device

2025-06-05 Thread Donald Dutile




On 6/5/25 1:58 PM, Nathan Chen wrote:



On 6/4/2025 7:34 PM, Donald Dutile wrote:

On 6/4/25 10:02 PM, Nathan Chen wrote:

On 6/2/2025 8:41 AM, Shameer Kolothum wrote:

This patch series introduces support for a user-creatable SMMUv3 device
(-device arm-smmuv3) in QEMU.


Tested-by: Nathan Chen 

I am able to create 16 SMMUv3 devices in a qemu VM with emulated devices 
properly associated with the guest SMMUs in guest sysfs - verified with some 
guest SMMUs having two or three emulated NICs assigned to them while other 
guest SMMUs have a minimum of one assigned.

Thanks,
Nathan


Nathan,
Great test!
Can you share the xml &/or qemu cmdline, to demonstrate this broad capability?
Also would be a good reference when (we know it isn't 'if') a new/ different 
config
that wasn't tested shows a bug, and we'll have more info to work with.


Yes, here is the qemu command line I used. smmuv3.15 is associated with two 
devices, smmuv3.0 is associated with two devices, and smmuv3.1 is associated 
with three devices:

qemu-system-aarch64 \
     -machine hmat=on -machine virt,accel=kvm,gic-version=3,ras=on \
     -cpu host -smp cpus=4 -m size=16G,slots=4,maxmem=32G -nographic \
     -bios /usr/share/AAVMF/AAVMF_CODE.fd \
     -device nvme,drive=nvme0,serial=deadbeaf1,bus=pcie.0 \
     -drive 
file=/home/nvidia/nathanc/noble-server-cloudimg-arm64.qcow2,index=0,media=disk,format=qcow2,if=none,id=nvme0
 \
     -device 
e1000,romfile=/root/nathanc/efi-e1000.rom,netdev=net0,bus=pcie.0 \
     -netdev user,id=net0,hostfwd=tcp::5558-:22,hostfwd=tcp::5586-:5586 \
     -netdev user,id=net1 \
     -netdev user,id=net2 \
     -netdev user,id=net3 \
     -netdev user,id=net4 \
     -netdev user,id=net5 \
     -netdev user,id=net6 \
     -netdev user,id=net7 \
     -netdev user,id=net8 \
     -netdev user,id=net9 \
     -netdev user,id=net10 \
     -netdev user,id=net11 \
     -netdev user,id=net12 \
     -netdev user,id=net13 \
     -netdev user,id=net14 \
     -netdev user,id=net15 \
     -netdev user,id=net16 \
     -netdev user,id=net17 \
     -netdev user,id=net18 \
     -netdev user,id=net19 \
     -netdev user,id=net20 \
     -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.0 \
     -device virtio-net-pci,bus=pcie.0,netdev=net1 \
     -device virtio-net-pci,bus=pcie.0,netdev=net20 \
     -device pxb-pcie,id=pcie.1,bus_nr=200,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1 \
     -device pcie-root-port,id=pcie.port1,bus=pcie.1,slot=1,chassis=1 \
     -device virtio-net-pci,bus=pcie.port1,netdev=net2 \
     -device pcie-root-port,id=pcie.port17,bus=pcie.1,slot=17,chassis=17 \
     -device virtio-net-pci,bus=pcie.port17,netdev=net18 \
     -device pcie-root-port,id=pcie.port18,bus=pcie.1,slot=18,chassis=18 \
     -device virtio-net-pci,bus=pcie.port18,netdev=net19 \
     -device pxb-pcie,id=pcie.2,bus_nr=196,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.2,id=smmuv3.2 \
     -device pcie-root-port,id=pcie.port2,bus=pcie.2,slot=2,chassis=2 \
     -device virtio-net-pci,bus=pcie.port2,netdev=net3 \
     -device pxb-pcie,id=pcie.3,bus_nr=192,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.3,id=smmuv3.3 \
     -device pcie-root-port,id=pcie.port3,bus=pcie.3,slot=3,chassis=3 \
     -device virtio-net-pci,bus=pcie.port3,netdev=net4 \
     -device pxb-pcie,id=pcie.4,bus_nr=188,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.4,id=smmuv3.4 \
     -device pcie-root-port,id=pcie.port4,bus=pcie.4,slot=4,chassis=4 \
     -device virtio-net-pci,bus=pcie.port4,netdev=net5 \
     -device pxb-pcie,id=pcie.5,bus_nr=184,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.5,id=smmuv3.5 \
     -device pcie-root-port,id=pcie.port5,bus=pcie.5,slot=5,chassis=5 \
     -device virtio-net-pci,bus=pcie.port5,netdev=net6 \
     -device pxb-pcie,id=pcie.6,bus_nr=180,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.6,id=smmuv3.6 \
     -device pcie-root-port,id=pcie.port6,bus=pcie.6,slot=6,chassis=6 \
     -device virtio-net-pci,bus=pcie.port6,netdev=net7 \
     -device pxb-pcie,id=pcie.7,bus_nr=176,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.7,id=smmuv3.7 \
     -device pcie-root-port,id=pcie.port7,bus=pcie.7,slot=7,chassis=7 \
     -device virtio-net-pci,bus=pcie.port7,netdev=net8 \
     -device pxb-pcie,id=pcie.8,bus_nr=172,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.8,id=smmuv3.8 \
     -device pcie-root-port,id=pcie.port8,bus=pcie.8,slot=8,chassis=8 \
     -device virtio-net-pci,bus=pcie.port8,netdev=net9 \
     -device pxb-pcie,id=pcie.9,bus_nr=168,bus=pcie.0 \
     -device arm-smmuv3,primary-bus=pcie.9,id=smmuv3.9 \
     -device pcie-root-port,id=pcie.port9,bus=pcie.9,slot=9,chassi

Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-09 Thread Donald Dutile




On 7/9/25 3:20 PM, Nicolin Chen wrote:

On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:

+enum {
+VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
+};


Thanks for this work. I am happy to see that we can share the
common code that allocates a NESTING_PARENT in the core using
this flag.

Yet on ARM, a STAGE1 page table isn't always a nested S1, the
hardware accelerated one. More often, it can be just a regular
1-stage translation table via emulated translation code and an
emulated iotlb.


Because the user-created smmuv3 started as 'accelerated smmuv3',
and had been 'de-accelerated' to simply 'user created smmuv3',
I'm looking for some clarification in the above statement/request.

Is the above suppose to reflect that a nested IOMMU has some hw-acceleration
in its Stage1 implementation?
If so, then call it that: STAGE1_ACCEL.
If it's suppose to represent that an IOMMU has nested/2-stage support,
then the above is a valid cap;  -but-, having a nested/2-stage support IOMMU
doesn't necessarily mean its accelerated.


Well, there are an emulated "nested" mode and an hw-accelerated
"nested" mode in the smmuv3 code, so we had to choose something
like "accel" over "nested".

Here, on the other hand, I think the core using this CAP would
unlikely care about an emulated "nested" mode in the individual
vIOMMU..

So I suggested:
  /* hardware-accelerated nested stage-1 page table support */
 VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),

which it should be clear IMHO.

If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?

Thanks
Nicolin


If the distinction is hw-based s1 vs emulated-based s1, than
I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use
of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration'
feature/option in the SMMU spec.

Thanks,
- Don




Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-09 Thread Donald Dutile




On 7/8/25 8:39 PM, Nicolin Chen wrote:

On Tue, Jul 08, 2025 at 07:05:43AM -0400, Zhenzhong Duan wrote:

diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 00..e80aaf4431
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,16 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+enum {
+VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
+};


Thanks for this work. I am happy to see that we can share the
common code that allocates a NESTING_PARENT in the core using
this flag.

Yet on ARM, a STAGE1 page table isn't always a nested S1, the
hardware accelerated one. More often, it can be just a regular
1-stage translation table via emulated translation code and an
emulated iotlb.


Because the user-created smmuv3 started as 'accelerated smmuv3',
and had been 'de-accelerated' to simply 'user created smmuv3',
I'm looking for some clarification in the above statement/request.

Is the above suppose to reflect that a nested IOMMU has some hw-acceleration
in its Stage1 implementation?
If so, then call it that: STAGE1_ACCEL.
If it's suppose to represent that an IOMMU has nested/2-stage support,
then the above is a valid cap;  -but-, having a nested/2-stage support IOMMU
doesn't necessarily mean its accelerated.

So, let's ensure terminology, esp. bit-macro names(pace) reflects
the (exact) meaning, and not any indirect meaning.

A recent kvm-arm email thread had such indirect naming cleaned up
when referring to pfn-map/device-map/struct-page-mapped pages, which
I'd like not to start down a similar mis-naming/indirect-naming path here.

Look forward to the clarification.
- Don


I think this flag should indicate that the vIOMMU supports a
HW-accelerated nested S1 HWPT allocation/invalidation.

So, perhaps:
 /* hardware-accelerated nested stage-1 page table support */
 VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
?

Nicolin






Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval

2025-07-10 Thread Donald Dutile




On 7/10/25 12:21 PM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Donald Dutile 
Sent: Thursday, July 10, 2025 5:00 PM
To: Shameerali Kolothum Thodi
; Nicolin Chen

Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
berra...@redhat.com; imamm...@redhat.com; nath...@nvidia.com;
mo...@nvidia.com; smost...@google.com; gustavo.rom...@linaro.org;
m...@redhat.com; marcel.apfelb...@gmail.com; Linuxarm
; Wangzhou (B) ;
jiangkunkun ; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [PATCH v7 07/12] hw/pci: Introduce
pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval



On 7/10/25 3:37 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Nicolin Chen 
Sent: Thursday, July 10, 2025 1:07 AM
To: Shameerali Kolothum Thodi



Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
ddut...@redhat.com; berra...@redhat.com; imamm...@redhat.com;
nath...@nvidia.com; mo...@nvidia.com; smost...@google.com;
gustavo.rom...@linaro.org; m...@redhat.com;
marcel.apfelb...@gmail.com; Linuxarm ;

Wangzhou

(B) ; jiangkunkun

;

Jonathan Cameron ;
zhangfei@linaro.org
Subject: Re: [PATCH v7 07/12] hw/pci: Introduce
pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval

On Wed, Jul 09, 2025 at 08:20:35AM +, Shameerali Kolothum Thodi
wrote:

On Tue, Jul 08, 2025 at 04:40:50PM +0100, Shameer Kolothum wrote:

@@ -2909,6 +2909,19 @@ static void

pci_device_get_iommu_bus_devfn(PCIDevice *dev,

   }
   }

+/*
+ * When multiple PCI Express Root Buses are defined using
+ pxb-

pcie,

+ * the IOMMU configuration may be specific to each root bus.

However,

+ * pxb-pcie acts as a special root complex whose parent
+ is

effectively

+ * the default root complex(pcie.0). Ensure that we retrieve the
+ * correct IOMMU ops(if any) in such cases.
+ */
+if (pci_bus_is_express(iommu_bus) &&

pci_bus_is_root(iommu_bus)) {

+if (!iommu_bus->iommu_per_bus && parent_bus-
iommu_per_bus) {
+break;


Mind elaborating why the current bus must unset iommu_per_bus

while

its parent sets iommu_per_bus?

My understanding is that for a pxb-pcie we should set

iommu_per_bus

but for its parent (the default root complex) we should unset its
iommu_per_bus?


Well, for new arm-smmuv3 dev you need an associated pcie root
complex. Either the default pcie.0 or a pxb-pcie one. And as I
mentioned in my reply to the other thread(patch #2) and commit log

here,

the pxb-pcie is special extra root complex in Qemu which has pcie.0
has parent bus.

The above pci_device_get_iommu_bus_devfn() at present, iterate over

the

parent_dev if it is set and returns the parent_bus IOMMU ops even if
the associated pxb-pcie bus doesn't have any IOMMU. This creates
problem for a case that is described here in the cover letter here,
https://lore.kernel.org/qemu-devel/20250708154055.101012-1-

shameerali.kolothum.th...@huawei.com/


(Please see "Major changes from v4:" section)

To address that issue, this patch introduces an new helper function
to

specify that

the IOMMU ops are specific to the associated root

complex(iommu_per_bus) and

use that to return the correct IOMMU ops.

Hope with that context it is clear now.


Hmm, I was not questioning the context, I get what the patch is
supposed to do.

I was asking the logic that is unclear to me why it breaks when:
  !pxb-pcie->iommu_per_bus && pcie.0->iommu_per_bus

Or in which case pcie.0 would be set to iommu_per_bus=true?


Yes. Consider the example I gave in cover  letter,

-device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ -device
virtio-net-pci,bus=pcie.0,netdev=net0,id=virtionet.0 \ -device
pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device
arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \ -device
pcie-root-port,id=pcie.port1,chassis=2,bus=pcie.1 \ -device
virtio-net-pci,bus=pcie.port1,netdev=net1,id=virtionet.1

pcie.0 is behind new SMMUv3 dev(smmuv3.1) and has iommu_per_bus

set.

pcie.1 has no SMMv3U and iommu_per_bus is not set for it.

pcie.1 doesn't?   then what is this line saying/meaning?:
   -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \

I read that as an smmuv3 attached to pcie.1, with an id of smmuv3.2; just
as I read this config:
   -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \ as an smmuv3
attached to pcie.0 iwth id smmuv3.1


Oops..I forgot to delete that from the config:
This is what I meant,

-device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \
-device virtio-net-pci,bus=pcie.0,netdev=net0,id=virtionet.0 \
-device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \
-device pcie-root-port,id=pcie.port1,chassis=2,bus=pcie.1 \
-device virtio-net-pci,bus=pcie.port1,netdev=net1,id=virtionet.1 \

Thanks,
Shameer


the dirt is in the details... thank

Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-10 Thread Donald Dutile




On 7/10/25 4:11 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Nicolin Chen 
Sent: Wednesday, July 9, 2025 8:20 PM
To: Donald Dutile 
Cc: Zhenzhong Duan ; qemu-
de...@nongnu.org; alex.william...@redhat.com; c...@redhat.com;
eric.au...@redhat.com; m...@redhat.com; jasow...@redhat.com;
pet...@redhat.com; j...@nvidia.com; Shameerali Kolothum Thodi
; joao.m.mart...@oracle.com;
clement.mathieu--d...@eviden.com; kevin.t...@intel.com;
yi.l@intel.com; chao.p.p...@intel.com
Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
pci_device_get_viommu_cap()

On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:

+enum {
+VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table

supported */

+};


Thanks for this work. I am happy to see that we can share the
common code that allocates a NESTING_PARENT in the core using
this flag.

Yet on ARM, a STAGE1 page table isn't always a nested S1, the
hardware accelerated one. More often, it can be just a regular
1-stage translation table via emulated translation code and an
emulated iotlb.


Because the user-created smmuv3 started as 'accelerated smmuv3',
and had been 'de-accelerated' to simply 'user created smmuv3',
I'm looking for some clarification in the above statement/request.

Is the above suppose to reflect that a nested IOMMU has some hw-

acceleration

in its Stage1 implementation?
If so, then call it that: STAGE1_ACCEL.
If it's suppose to represent that an IOMMU has nested/2-stage support,
then the above is a valid cap;  -but-, having a nested/2-stage support

IOMMU

doesn't necessarily mean its accelerated.


Well, there are an emulated "nested" mode and an hw-accelerated
"nested" mode in the smmuv3 code, so we had to choose something
like "accel" over "nested".

Here, on the other hand, I think the core using this CAP would
unlikely care about an emulated "nested" mode in the individual
vIOMMU..

So I suggested:
  /* hardware-accelerated nested stage-1 page table support */
 VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),

which it should be clear IMHO.

If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?


I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP.
With the new SMMUv3 dev support, the user can pretty much specify,

-device 
arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}

And I think it will work with a host SMMUv3 nested configuration in all the
above cases. Unless I am missing something and we need to restrict its
use with stage=stage1 only.

Thanks,
Shameer






Re: [PATCH v7 07/12] hw/pci: Introduce pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval

2025-07-10 Thread Donald Dutile




On 7/10/25 3:37 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Nicolin Chen 
Sent: Thursday, July 10, 2025 1:07 AM
To: Shameerali Kolothum Thodi 
Cc: qemu-...@nongnu.org; qemu-devel@nongnu.org;
eric.au...@redhat.com; peter.mayd...@linaro.org; j...@nvidia.com;
ddut...@redhat.com; berra...@redhat.com; imamm...@redhat.com;
nath...@nvidia.com; mo...@nvidia.com; smost...@google.com;
gustavo.rom...@linaro.org; m...@redhat.com;
marcel.apfelb...@gmail.com; Linuxarm ;
Wangzhou (B) ; jiangkunkun
; Jonathan Cameron
; zhangfei@linaro.org
Subject: Re: [PATCH v7 07/12] hw/pci: Introduce
pci_setup_iommu_per_bus() for per-bus IOMMU ops retrieval

On Wed, Jul 09, 2025 at 08:20:35AM +, Shameerali Kolothum Thodi
wrote:

On Tue, Jul 08, 2025 at 04:40:50PM +0100, Shameer Kolothum wrote:

@@ -2909,6 +2909,19 @@ static void

pci_device_get_iommu_bus_devfn(PCIDevice *dev,

  }
  }

+/*
+ * When multiple PCI Express Root Buses are defined using pxb-

pcie,

+ * the IOMMU configuration may be specific to each root bus.

However,

+ * pxb-pcie acts as a special root complex whose parent is

effectively

+ * the default root complex(pcie.0). Ensure that we retrieve the
+ * correct IOMMU ops(if any) in such cases.
+ */
+if (pci_bus_is_express(iommu_bus) &&

pci_bus_is_root(iommu_bus)) {

+if (!iommu_bus->iommu_per_bus && parent_bus-
iommu_per_bus) {
+break;


Mind elaborating why the current bus must unset iommu_per_bus

while

its parent sets iommu_per_bus?

My understanding is that for a pxb-pcie we should set iommu_per_bus
but for its parent (the default root complex) we should unset its
iommu_per_bus?


Well, for new arm-smmuv3 dev you need an associated pcie root
complex. Either the default pcie.0 or a pxb-pcie one. And as I
mentioned in my reply to the other thread(patch #2) and commit log

here,

the pxb-pcie is special extra root complex in Qemu which has pcie.0 has
parent bus.

The above pci_device_get_iommu_bus_devfn() at present, iterate over

the

parent_dev if it is set and returns the parent_bus IOMMU ops even if the
associated pxb-pcie bus doesn't have any IOMMU. This creates problem
for a case that is described here in the cover letter here,
https://lore.kernel.org/qemu-devel/20250708154055.101012-1-

shameerali.kolothum.th...@huawei.com/


(Please see "Major changes from v4:" section)

To address that issue, this patch introduces an new helper function to

specify that

the IOMMU ops are specific to the associated root

complex(iommu_per_bus) and

use that to return the correct IOMMU ops.

Hope with that context it is clear now.


Hmm, I was not questioning the context, I get what the patch is
supposed to do.

I was asking the logic that is unclear to me why it breaks when:
 !pxb-pcie->iommu_per_bus && pcie.0->iommu_per_bus

Or in which case pcie.0 would be set to iommu_per_bus=true?


Yes. Consider the example I gave in cover  letter,

-device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \
-device virtio-net-pci,bus=pcie.0,netdev=net0,id=virtionet.0 \
-device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \
-device pcie-root-port,id=pcie.port1,chassis=2,bus=pcie.1 \
-device virtio-net-pci,bus=pcie.port1,netdev=net1,id=virtionet.1

pcie.0 is behind new SMMUv3 dev(smmuv3.1) and has iommu_per_bus set.
pcie.1 has no SMMv3U and iommu_per_bus is not set for it.

pcie.1 doesn't?   then what is this line saying/meaning?:
 -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.2 \

I read that as an smmuv3 attached to pcie.1, with an id of smmuv3.2;
just as I read this config:
 -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1 \
as an smmuv3 attached to pcie.0 iwth id smmuv3.1



And we don't want pci_device_get_iommu_bus_devfn() to return pcie.0's
IOMMU ops for virtionet.1. Hence the break.

Thanks,
Shameer






Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-10 Thread Donald Dutile

oops, inadvertently hit send before I could add reply... apologies... reply 
below...

On 7/10/25 1:01 PM, Donald Dutile wrote:



On 7/10/25 4:11 AM, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Nicolin Chen 
Sent: Wednesday, July 9, 2025 8:20 PM
To: Donald Dutile 
Cc: Zhenzhong Duan ; qemu-
de...@nongnu.org; alex.william...@redhat.com; c...@redhat.com;
eric.au...@redhat.com; m...@redhat.com; jasow...@redhat.com;
pet...@redhat.com; j...@nvidia.com; Shameerali Kolothum Thodi
; joao.m.mart...@oracle.com;
clement.mathieu--d...@eviden.com; kevin.t...@intel.com;
yi.l@intel.com; chao.p.p...@intel.com
Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
pci_device_get_viommu_cap()

On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:

+enum {
+    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table

supported */

+};


Thanks for this work. I am happy to see that we can share the
common code that allocates a NESTING_PARENT in the core using
this flag.

Yet on ARM, a STAGE1 page table isn't always a nested S1, the
hardware accelerated one. More often, it can be just a regular
1-stage translation table via emulated translation code and an
emulated iotlb.


Because the user-created smmuv3 started as 'accelerated smmuv3',
and had been 'de-accelerated' to simply 'user created smmuv3',
I'm looking for some clarification in the above statement/request.

Is the above suppose to reflect that a nested IOMMU has some hw-

acceleration

in its Stage1 implementation?
If so, then call it that: STAGE1_ACCEL.
If it's suppose to represent that an IOMMU has nested/2-stage support,
then the above is a valid cap;  -but-, having a nested/2-stage support

IOMMU

doesn't necessarily mean its accelerated.


Well, there are an emulated "nested" mode and an hw-accelerated
"nested" mode in the smmuv3 code, so we had to choose something
like "accel" over "nested".

Here, on the other hand, I think the core using this CAP would
unlikely care about an emulated "nested" mode in the individual
vIOMMU..

So I suggested:
  /* hardware-accelerated nested stage-1 page table support */
 VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),

which it should be clear IMHO.

If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?


I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP.
With the new SMMUv3 dev support, the user can pretty much specify,

-device 
arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}

And I think it will work with a host SMMUv3 nested configuration in all the
above cases. Unless I am missing something and we need to restrict its
use with stage=stage1 only.

Thanks,
Shameer


Shameer,
I didn't think of these config options this way... so is this CAP always 0 for 
smmuv3's associated VIOMMU?

or ... should intel-iommu follow this same/smmuv3 config option(s) and avoid 
this VIOMMU CAP req. ?

- Don




Re: [PATCH v6 01/12] hw/arm/virt-acpi-build: Don't create ITS id mappings by default

2025-07-03 Thread Donald Dutile




On 7/3/25 4:46 AM, Shameer Kolothum wrote:

Commit d6afe18b7242 ("hw/arm/virt-acpi-build: Fix ACPI IORT and MADT tables
when its=off") moved ITS group node generation under the its=on condition.
However, it still creates rc_its_idmaps unconditionally, which results in
duplicate ID mappings in the IORT table.

Fixes:d6afe18b7242 ("hw/arm/virt-acpi-build: Fix ACPI IORT and MADT tables when 
its=off")
Signed-off-by: Shameer Kolothum 
---
  hw/arm/virt-acpi-build.c | 6 --
  1 file changed, 6 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index cd90c47976..724fad5ffa 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -329,12 +329,6 @@ build_iort(GArray *table_data, BIOSLinker *linker, 
VirtMachineState *vms)
  /* Sort the smmu idmap by input_base */
  g_array_sort(rc_smmu_idmaps, iort_idmap_compare);
  
-/*

- * Knowing the ID ranges from the RC to the SMMU, it's possible to
- * determine the ID ranges from RC that are directed to the ITS.
- */
-create_rc_its_idmaps(rc_its_idmaps, rc_smmu_idmaps);
-
  nb_nodes = 2; /* RC and SMMUv3 */
  rc_mapping_count = rc_smmu_idmaps->len;
  

Reviewed-by: Donald Dutile 




Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()

2025-07-15 Thread Donald Dutile




On 7/15/25 11:28 AM, Eric Auger wrote:

Hi,

On 7/10/25 3:22 AM, Donald Dutile wrote:



On 7/9/25 3:20 PM, Nicolin Chen wrote:

On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:

+enum {
+    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
supported */
+};


Thanks for this work. I am happy to see that we can share the
common code that allocates a NESTING_PARENT in the core using
this flag.

Yet on ARM, a STAGE1 page table isn't always a nested S1, the
hardware accelerated one. More often, it can be just a regular
1-stage translation table via emulated translation code and an
emulated iotlb.


Because the user-created smmuv3 started as 'accelerated smmuv3',
and had been 'de-accelerated' to simply 'user created smmuv3',
I'm looking for some clarification in the above statement/request.

Is the above suppose to reflect that a nested IOMMU has some
hw-acceleration
in its Stage1 implementation?
If so, then call it that: STAGE1_ACCEL.
If it's suppose to represent that an IOMMU has nested/2-stage support,
then the above is a valid cap;  -but-, having a nested/2-stage
support IOMMU
doesn't necessarily mean its accelerated.


Well, there are an emulated "nested" mode and an hw-accelerated
"nested" mode in the smmuv3 code, so we had to choose something
like "accel" over "nested".

Here, on the other hand, I think the core using this CAP would
unlikely care about an emulated "nested" mode in the individual
vIOMMU..

So I suggested:
   /* hardware-accelerated nested stage-1 page table support */
  VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),

which it should be clear IMHO.

If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?

Thanks
Nicolin


If the distinction is hw-based s1 vs emulated-based s1, than
I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use

VIOMMU_CAP_HW_NESTED_S1 or even VIOMMU_CAP_HW_NESTED look good to me too

Thanks

Eric

+1 to VIOMMU_CAP_HW_NESTED ; the S1 is unnecessary if nested IOMMU is the info 
being conveyed.
- Don


of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration'
feature/option in the SMMU spec.

Thanks,
- Don