On 03/11/2025 13:16, Milan Djokic wrote:
Hello Volodymyr, Julien
Hi Milan,
Thanks for the new update. For the future, can you trim your reply?
Sorry for the delayed follow-up on this topic.
We have changed vIOMMU design from 1-N to N-N mapping between vIOMMU and
pIOMMU. Considering single vIOMMU model limitation pointed out by
Volodymyr (SID overlaps), vIOMMU-per-pIOMMU model turned out to be the
only proper solution.
I am not sure to fully understand. My assumption with the single vIOMMU
is you have a virtual SID that would be mapped to a (pIOMMU, physical
SID). Does this means in your solution you will end up with multiple
vPCI as well and then map pBDF == vBDF? (this because the SID have to be
fixed at boot)
Following is the updated design document.
I have added additional details to the design and performance impact
sections, and also indicated future improvements. Security
considerations section is unchanged apart from some minor details
according to review comments.
Let me know what do you think about updated design. Once approved, I
will send the updated vIOMMU patch series.
==========================================================
Design Proposal: Add SMMUv3 Stage-1 Support for XEN Guests
==========================================================
:Author: Milan Djokic <[email protected]>
:Date: 2025-11-03
:Status: Draft
Introduction
============
The SMMUv3 supports two stages of translation. Each stage of translation
can be
independently enabled. An incoming address is logically translated from
VA to
IPA in stage 1, then the IPA is input to stage 2 which translates the
IPA to
the output PA. Stage 1 translation support is required to provide
isolation between different
devices within OS. XEN already supports Stage 2 translation but there is no
support for Stage 1 translation.
This design proposal outlines the introduction of Stage-1 SMMUv3 support
in Xen for ARM guests.
Motivation
==========
ARM systems utilizing SMMUv3 require stage-1 address translation to
ensure secure DMA and
guest managed I/O memory mappings.
With stage-1 enabed, guest manages IOVA to IPA mappings through its own
IOMMU driver.
This feature enables:
- Stage-1 translation in guest domain
- Safe device passthrough with per-device address translation table
I find this misleading. Even without this feature, device passthrough is
still safe in the sense a device will be isolated (assuming all the DMA
goes through the IOMMU) and will not be able to DMA outside of the guest
memory. What the stage-1 is doing is providing an extra layer to control
what each device can see. This is useful if you don't trust your devices
or you want to assign a device to userspace (e.g. for DPDK).
Design Overview
===============
These changes provide emulated SMMUv3 support:
If my understanding is correct, there are all some implications in how
we create the PCI topology. It would be good to spell them out.
- **SMMUv3 Stage-1 Translation**: stage-1 and nested translation support
in SMMUv3 driver.
- **vIOMMU Abstraction**: Virtual IOMMU framework for guest stage-1
handling.
- **Register/Command Emulation**: SMMUv3 register emulation and command
queue handling.
- **Device Tree Extensions**: Adds `iommus` and virtual SMMUv3 nodes to
device trees for dom0 and dom0less scenarios.
What about ACPI?
- **Runtime Configuration**: Introduces a `viommu` boot parameter for
dynamic enablement.
Separate vIOMMU device is exposed to guest for every physical IOMMU in
the system.
vIOMMU feature is designed in a way to provide a generic vIOMMU
framework and a backend implementation
for target IOMMU as separate components.
Backend implementation contains specific IOMMU structure and commands
handling (only SMMUv3 currently supported).
This structure allows potential reuse of stage-1 feature for other IOMMU
types.
Security Considerations
=======================
**viommu security benefits:**
- Stage-1 translation ensures guest devices cannot perform unauthorized
DMA (device I/O address mapping managed by guest).
- Emulated IOMMU removes guest direct dependency on IOMMU hardware,
while maintaining domains isolation.
Sorry, I don't follow this argument. Are you saying that it would be
possible to emulate a SMMUv3 vIOMMU on top of the IPMMU?
1. Observation:
---------------
Support for Stage-1 translation in SMMUv3 introduces new data structures
(`s1_cfg` alongside `s2_cfg`)
and logic to write both Stage-1 and Stage-2 entries in the Stream Table
Entry (STE), including an `abort`
field to handle partial configuration states.
**Risk:**
Without proper handling, a partially applied Stage-1 configuration might
leave guest DMA mappings in an
inconsistent state, potentially enabling unauthorized access or causing
cross-domain interference.
How so? Even if you misconfigure the S1, the S2 would still be properly
configured (you just mention partially applied stage-1).
**Mitigation:** *(Handled by design)*
This feature introduces logic that writes both `s1_cfg` and `s2_cfg` to
STE and manages the `abort` field-only
considering Stage-1 configuration if fully attached. This ensures
incomplete or invalid guest configurations
are safely ignored by the hypervisor.
Can you clarify what you mean by invalid guest configurations?
2. Observation:
---------------
Guests can now invalidate Stage-1 caches; invalidation needs forwarding
to SMMUv3 hardware to maintain coherence.
**Risk:**
Failing to propagate cache invalidation could allow stale mappings,
enabling access to old mappings and possibly
data leakage or misrouting.
You are referring to data leakage/misrouting between two devices own by
the same guest, right? Xen would still be in charge of flush when the
stage-2 is updated.
**Mitigation:** *(Handled by design)*
This feature ensures that guest-initiated invalidations are correctly
forwarded to the hardware,
preserving IOMMU coherency.
How is this a mitigation? You have to properly handle commands. If you
don't properly handle them, then yes it will break.
4. Observation:
---------------
The code includes transformations to handle nested translation versus
standard modes and uses guest-configured
command queues (e.g., `CMD_CFGI_STE`) and event notifications.
**Risk:**
Malicious or malformed queue commands from guests could bypass
validation, manipulate SMMUv3 state,
or cause system instability.
**Mitigation:** *(Handled by design)*
Built-in validation of command queue entries and sanitization mechanisms
ensure only permitted configurations
are applied.
This is true as long as we didn't make an mistake in the configurations ;).
This is supported via additions in `vsmmuv3` and `cmdqueue`
handling code.
5. Observation:
---------------
Device Tree modifications enable device assignment and configuration
through guest DT fragments (e.g., `iommus`)
are added via `libxl`.
**Risk:**
Erroneous or malicious Device Tree injection could result in device
misbinding or guest access to unauthorized
hardware.
The DT fragment are not security support and will never be at least
until you have can a libfdt that is able to detect malformed Device-Tree
(I haven't checked if this has changed recently).
**Mitigation:**
- `libxl` perform checks of guest configuration and parse only
predefined dt fragments and nodes, reducing risk.
- The system integrator must ensure correct resource mapping in the
guest Device Tree (DT) fragments.
> > 6. Observation:
---------------
Introducing optional per-guest enabled features (`viommu` argument in xl
guest config) means some guests
may opt-out.
**Risk:**
Differences between guests with and without `viommu` may cause
unexpected behavior or privilege drift.
I don't understand this risk. Can you clarify?
**Mitigation:**
Verify that downgrade paths are safe and well-isolated; ensure missing
support doesn't cause security issues.
Additional audits on emulation paths and domains interference need to be
performed in a multi-guest environment.
7. Observation:
---------------
This observation with 7, 8 and 9 are the most important observations but
it seems to be missing some details on how this will be implemented. I
will try to provide some questions that should help filling the gaps.
Guests have the ability to issue Stage-1 IOMMU commands like cache
invalidation, stream table entries
configuration, etc. An adversarial guest may issue a high volume of
commands in rapid succession.
**Risk:**
Excessive commands requests can cause high hypervisor CPU consumption
and disrupt scheduling,
leading to degraded system responsiveness and potential denial-of-
service scenarios.
**Mitigation:**
- Xen scheduler limits guest vCPU execution time, securing basic guest
rate-limiting.
This really depends on your scheduler. Some scheduler (e.g. NULL) will
not do any scheduling at all. Furthermore, the scheduler only preempt
EL1/EL0. It doesn't preempt EL2, so any long running operation need
manual preemption. Therefore, I wouldn't consider this as a mitigation.
- Batch multiple commands of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
The guest can send commands in any order. So can you expand how this
would work? Maybe with some example.
- Implement vIOMMU commands execution restart and continuation support
This needs a bit more details. How will you decide whether to restart
and what would be the action? (I guess it will be re-executing the
instruction to write to the CWRITER).
8. Observation:
---------------
Some guest commands issued towards vIOMMU are propagated to pIOMMU
command queue (e.g. TLB invalidate).
**Risk:**
Excessive commands requests from abusive guest can cause flooding of
physical IOMMU command queue,
leading to degraded pIOMMU responsivness on commands issued from other
guests.
**Mitigation:**
- Xen credit scheduler limits guest vCPU execution time, securing basic
guest rate-limiting.
Same as above. This mitigation cannot be used.
- Batch commands which should be propagated towards pIOMMU cmd queue and
enable support for batch
execution pause/continuation
Can this be expanded?
- If possible, implement domain penalization by adding a per-domain cost
counter for vIOMMU/pIOMMU usage.
Can this be expanded?
9. Observation:
---------------
vIOMMU feature includes event queue used for forwarding IOMMU events to
guest
(e.g. translation faults, invalid stream IDs, permission errors).
A malicious guest can misconfigure its SMMU state or intentionally
trigger faults with high frequency.
**Risk:**
Occurance of IOMMU events with high frequency can cause Xen to flood the
s/occurance/occurrence/
event queue and disrupt scheduling with
high hypervisor CPU load for events handling.
**Mitigation:**
- Implement fail-safe state by disabling events forwarding when faults
are occured with high frequency and
not processed by guest.
I am not sure to understand how this would work. Can you expand?
- Batch multiple events of same type to reduce overhead on the virtual
SMMUv3 hardware emulation.
Ditto.
- Consider disabling event queue for untrusted guests
My understanding is there is only a single physical event queue. Xen
would be responsible to handle the events in the queue and forward to
the respective guests. If so, it is not clear what you mean by "disable
event queue".
Performance Impact
==================
With iommu stage-1 and nested translation inclusion, performance
overhead is introduced comparing to existing,
stage-2 only usage in Xen. Once mappings are established, translations
should not introduce significant overhead.
Emulated paths may introduce moderate overhead, primarily affecting
device initialization and event handling.
Performance impact highly depends on target CPU capabilities.
Testing is performed on QEMU virt and Renesas R-Car (QEMU emulated)
platforms.
I am afraid QEMU is not a reliable platform to do performance testing.
Don't you have a real HW with vIOMMU support?
[...]
References
==========
- Original feature implemented by Rahul Singh:
https://patchwork.kernel.org/project/xen-devel/cover/
[email protected]/
- SMMUv3 architecture documentation
- Existing vIOMMU code patterns
I am not sure what this is referring to?
Cheers,
--
Julien Grall