** Description changed:
[ Impact ]
iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:
- To flush caches when a second-stage page table is attached to a device
whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
resident remapping structures are visible to hardware (CAP_REG.RWBF==1).
However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.
Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.
Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.
[ Fix ]
Backport the following commit:
- 12724ce3fe1a iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF
modes
+ - b9434ba97c44 iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
+ - b33125296b50 iommu/vt-d: Create unique domain ops for each stage
+ - 0fa6f0893466 iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
+ - 85cfaacc9937 iommu/vt-d: Split paging_domain_compatible()
- cee686775f9c iommu/vt-d: Make iotlb_sync_map a static property of
dmar_domain
to Plucky.
[ Test Plan ]
Run fio against two NVMEs under the same pci bridge (dual port NVMe):
$ sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting
--new_group --name=job1 --filename=/dev/nvmeXnY --new_group --name=job2
--filename=/dev/nvmeWnZ
verify that the speed reached with the two NVMEs under the same bridge
is the same that would have been reached if the two NVMEs were not under
the same bridge.
[ Regression Potential ]
This fix affects the Intel IOMMU (VT-d) driver.
An issue with this fix may introduce problems such as
incorrect omission of required IOTLB cache or write buffer flushes
when attaching devices to a domain.
This could result in memory remapping structures not being visible
to hardware in configurations that actually require synchronization.
As a consequence, devices performing DMA may exhibit data corruption,
access violations, or inconsistent behavior due to stale or incomplete
translations being used by the hardware.
---
[Description]
A performance regression has been reported when running fio against two NVMe
devices under the same pci bridge (dual port NVMe).
The issue was initially reported for 6.11-hwe kernel for Noble.
The performance regression was introduced in the 6.10 upstream kernel and is
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use
cache_tag_flush_range_np() in iotlb_sync_map").
In our tests we observe ~6150 MiB/s when the NVMe devices are on
different bridges and ~4985 MiB/s when under the same brigde.
Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
device placement.
[Test Case]
We can reproduce the issue on gcp on Z3 metal instance type
(z3-highmem-192-highlssd-metal) [1].
You need to have 2 NVMe devices under the same bridge, e.g:
# nvme list -v
...
Device SN MN FR
TxPort Address Slot Subsystem Namespaces
-------- -------------------- ----------------------------------------
-------- ------ -------------- ------ ------------ ----------------
nvme0 nvme_card-pd nvme_card-pd (null)
pcie 0000:05:00.1 nvme-subsys0 nvme0n1
nvme1 3DE4D285C21A7C001.0 nvme_card
00000000 pcie 0000:3d:00.0 nvme-subsys1 nvme1n1
nvme10 3DE4D285C21A7C001.1 nvme_card
00000000 pcie 0000:3d:00.1 nvme-subsys10 nvme10n1
nvme11 3DE4D285C2027C000.0 nvme_card
00000000 pcie 0000:3e:00.0 nvme-subsys11 nvme11n1
nvme12 3DE4D285C2027C000.1 nvme_card
00000000 pcie 0000:3e:00.1 nvme-subsys12 nvme12n1
nvme2 3DE4D285C2368C001.0 nvme_card
00000000 pcie 0000:b7:00.0 nvme-subsys2 nvme2n1
nvme3 3DE4D285C22A74001.0 nvme_card
00000000 pcie 0000:86:00.0 nvme-subsys3 nvme3n1
nvme4 3DE4D285C22A74001.1 nvme_card
00000000 pcie 0000:86:00.1 nvme-subsys4 nvme4n1
nvme5 3DE4D285C2368C001.1 nvme_card
00000000 pcie 0000:b7:00.1 nvme-subsys5 nvme5n1
nvme6 3DE4D285C21274000.0 nvme_card
00000000 pcie 0000:87:00.0 nvme-subsys6 nvme6n1
nvme7 3DE4D285C21094000.0 nvme_card
00000000 pcie 0000:b8:00.0 nvme-subsys7 nvme7n1
nvme8 3DE4D285C21274000.1 nvme_card
00000000 pcie 0000:87:00.1 nvme-subsys8 nvme8n1
nvme9 3DE4D285C21094000.1 nvme_card
00000000 pcie 0000:b8:00.1 nvme-subsys9 nvme9n1
...
For the output above, drives nvme1n1 and nvme10n1 are under the same
bridge, and looking the SN it seems it is a dual port NVMe.
- Under the same bridge
Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike
in the beginning at ~6150MiB/s.
# sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2
--filename=/dev/nvme10n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
...
- Under different bridge
Run fio against nvme1n1 and nvme11n1, observe
# sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2
--filename=/dev/nvme11n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
...
** So far, we haven't been able to reproduce it on another machine, but
we suspect will be reproducible with any machine with a dual port NVMe.
[Other]
In spreadsheet [2], the are some profiling data for different kernel
versions, showing consistent performance difference between kernel
versions.
Offending commit :
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327
Report issue upstream [3].
[1]
https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
[2]
https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169
[3]
https://lore.kernel.org/regressions/[email protected]/
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2115738
Title:
I/O performance regression on NVMes under same bridge (dual port nvme)
Status in linux package in Ubuntu:
In Progress
Status in linux source package in Oracular:
Won't Fix
Status in linux source package in Plucky:
In Progress
Status in linux source package in Questing:
In Progress
Bug description:
[ Impact ]
iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:
- To flush caches when a second-stage page table is attached to a device
whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
resident remapping structures are visible to hardware (CAP_REG.RWBF==1).
However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.
Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.
Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.
[ Fix ]
Backport the following commit:
- 12724ce3fe1a iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF
modes
- b9434ba97c44 iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
- b33125296b50 iommu/vt-d: Create unique domain ops for each stage
- 0fa6f0893466 iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
- 85cfaacc9937 iommu/vt-d: Split paging_domain_compatible()
- cee686775f9c iommu/vt-d: Make iotlb_sync_map a static property of
dmar_domain
to Plucky.
[ Test Plan ]
Run fio against two NVMEs under the same pci bridge (dual port NVMe):
$ sudo fio --readwrite=randread --blocksize=4k --iodepth=32
--numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1
--group_reporting --new_group --name=job1 --filename=/dev/nvmeXnY
--new_group --name=job2 --filename=/dev/nvmeWnZ
verify that the speed reached with the two NVMEs under the same bridge
is the same that would have been reached if the two NVMEs were not
under the same bridge.
[ Regression Potential ]
This fix affects the Intel IOMMU (VT-d) driver.
An issue with this fix may introduce problems such as
incorrect omission of required IOTLB cache or write buffer flushes
when attaching devices to a domain.
This could result in memory remapping structures not being visible
to hardware in configurations that actually require synchronization.
As a consequence, devices performing DMA may exhibit data corruption,
access violations, or inconsistent behavior due to stale or incomplete
translations being used by the hardware.
---
[Description]
A performance regression has been reported when running fio against two NVMe
devices under the same pci bridge (dual port NVMe).
The issue was initially reported for 6.11-hwe kernel for Noble.
The performance regression was introduced in the 6.10 upstream kernel and is
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use
cache_tag_flush_range_np() in iotlb_sync_map").
In our tests we observe ~6150 MiB/s when the NVMe devices are on
different bridges and ~4985 MiB/s when under the same brigde.
Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
device placement.
[Test Case]
We can reproduce the issue on gcp on Z3 metal instance type
(z3-highmem-192-highlssd-metal) [1].
You need to have 2 NVMe devices under the same bridge, e.g:
# nvme list -v
...
Device SN MN FR
TxPort Address Slot Subsystem Namespaces
-------- -------------------- ----------------------------------------
-------- ------ -------------- ------ ------------ ----------------
nvme0 nvme_card-pd nvme_card-pd (null)
pcie 0000:05:00.1 nvme-subsys0 nvme0n1
nvme1 3DE4D285C21A7C001.0 nvme_card
00000000 pcie 0000:3d:00.0 nvme-subsys1 nvme1n1
nvme10 3DE4D285C21A7C001.1 nvme_card
00000000 pcie 0000:3d:00.1 nvme-subsys10 nvme10n1
nvme11 3DE4D285C2027C000.0 nvme_card
00000000 pcie 0000:3e:00.0 nvme-subsys11 nvme11n1
nvme12 3DE4D285C2027C000.1 nvme_card
00000000 pcie 0000:3e:00.1 nvme-subsys12 nvme12n1
nvme2 3DE4D285C2368C001.0 nvme_card
00000000 pcie 0000:b7:00.0 nvme-subsys2 nvme2n1
nvme3 3DE4D285C22A74001.0 nvme_card
00000000 pcie 0000:86:00.0 nvme-subsys3 nvme3n1
nvme4 3DE4D285C22A74001.1 nvme_card
00000000 pcie 0000:86:00.1 nvme-subsys4 nvme4n1
nvme5 3DE4D285C2368C001.1 nvme_card
00000000 pcie 0000:b7:00.1 nvme-subsys5 nvme5n1
nvme6 3DE4D285C21274000.0 nvme_card
00000000 pcie 0000:87:00.0 nvme-subsys6 nvme6n1
nvme7 3DE4D285C21094000.0 nvme_card
00000000 pcie 0000:b8:00.0 nvme-subsys7 nvme7n1
nvme8 3DE4D285C21274000.1 nvme_card
00000000 pcie 0000:87:00.1 nvme-subsys8 nvme8n1
nvme9 3DE4D285C21094000.1 nvme_card
00000000 pcie 0000:b8:00.1 nvme-subsys9 nvme9n1
...
For the output above, drives nvme1n1 and nvme10n1 are under the same
bridge, and looking the SN it seems it is a dual port NVMe.
- Under the same bridge
Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike
in the beginning at ~6150MiB/s.
# sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2
--filename=/dev/nvme10n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
...
- Under different bridge
Run fio against nvme1n1 and nvme11n1, observe
# sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2
--filename=/dev/nvme11n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
...
** So far, we haven't been able to reproduce it on another machine,
but we suspect will be reproducible with any machine with a dual port
NVMe.
[Other]
In spreadsheet [2], the are some profiling data for different kernel
versions, showing consistent performance difference between kernel
versions.
Offending commit :
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327
Report issue upstream [3].
[1]
https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
[2]
https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169
[3]
https://lore.kernel.org/regressions/[email protected]/
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp