[Kernel-packages] [Bug 2115738] Re: I/O performance regression on NVMes under same bridge (dual port nvme)

Massimiliano Pellizzer Wed, 06 Aug 2025 07:10:48 -0700

** Description changed:

  [ Impact ]
  
  iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  
  The iotlb_sync_map iommu ops allows drivers to perform necessary cache
  flushes when new mappings are established. For the Intel iommu driver,
  this callback specifically serves two purposes:
  
  - To flush caches when a second-stage page table is attached to a device
    whose iommu is operating in caching mode (CAP_REG.CM==1).
  - To explicitly flush internal write buffers to ensure updates to memory-
    resident remapping structures are visible to hardware (CAP_REG.RWBF==1).
  
  However, in scenarios where neither caching mode nor the RWBF flag is
  active, the cache_tag_flush_range_np() helper, which is called in the
  iotlb_sync_map path, effectively becomes a no-op.
  
  Despite being a no-op, cache_tag_flush_range_np() involves iterating
  through all cache tags of the iommu's attached to the domain, protected
  by a spinlock. This unnecessary execution path introduces overhead,
  leading to a measurable I/O performance regression. On systems with NVMes
  under the same bridge, performance was observed to drop from approximately
  ~6150 MiB/s down to ~4985 MiB/s.
  
  Introduce a flag in the dmar_domain structure. This flag will only be set
  when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
  cache_tag_flush_range_np() is called only for domains where this flag is
  set. This flag, once set, is immutable, given that there won't be mixed
  configurations in real-world scenarios where some IOMMUs in a system
  operate in caching mode while others do not. Theoretically, the
  immutability of this flag does not impact functionality.
  
  [ Fix ]
  
  Backport the following commit:
  - 12724ce3fe1a iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF 
modes
+ - b9434ba97c44 iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
+ - b33125296b50 iommu/vt-d: Create unique domain ops for each stage
+ - 0fa6f0893466 iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
+ - 85cfaacc9937 iommu/vt-d: Split paging_domain_compatible()
  - cee686775f9c iommu/vt-d: Make iotlb_sync_map a static property of 
dmar_domain
  to Plucky.
  
  [ Test Plan ]
  
  Run fio against two NVMEs under the same pci bridge (dual port NVMe):
  
  $ sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8
  --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting
  --new_group --name=job1 --filename=/dev/nvmeXnY --new_group --name=job2
  --filename=/dev/nvmeWnZ
  
  verify that the speed reached with the two NVMEs under the same bridge
  is the same that would have been reached if the two NVMEs were not under
  the same bridge.
  
  [ Regression Potential ]
  
  This fix affects the Intel IOMMU (VT-d) driver.
  An issue with this fix may introduce problems such as
  incorrect omission of required IOTLB cache or write buffer flushes
  when attaching devices to a domain.
  This could result in memory remapping structures not being visible
  to hardware in configurations that actually require synchronization.
  As a consequence, devices performing DMA may exhibit data corruption,
  access violations, or inconsistent behavior due to stale or incomplete
  translations being used by the hardware.
  
  ---
  
  [Description]
  A performance regression has been reported when running fio against two NVMe 
devices under the same pci bridge (dual port NVMe).
  The issue was initially reported for 6.11-hwe kernel for Noble.
  The performance regression was introduced in the 6.10 upstream kernel and is 
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
  Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use 
cache_tag_flush_range_np() in iotlb_sync_map").
  
  In our tests we observe ~6150 MiB/s when the NVMe devices are on
  different bridges and ~4985 MiB/s when under the same brigde.
  
  Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
  device placement.
  
  [Test Case]
  
  We can reproduce the issue on gcp on Z3 metal instance type
  (z3-highmem-192-highlssd-metal) [1].
  
  You need to have 2 NVMe devices under the same bridge, e.g:
  
  # nvme list -v
  ...
  Device   SN                   MN                                       FR     
  TxPort Address        Slot   Subsystem    Namespaces
  -------- -------------------- ---------------------------------------- 
-------- ------ -------------- ------ ------------ ----------------
  nvme0    nvme_card-pd         nvme_card-pd                             (null) 
  pcie   0000:05:00.1          nvme-subsys0 nvme0n1
  nvme1    3DE4D285C21A7C001.0  nvme_card                                
00000000 pcie   0000:3d:00.0          nvme-subsys1 nvme1n1
  nvme10   3DE4D285C21A7C001.1  nvme_card                                
00000000 pcie   0000:3d:00.1          nvme-subsys10 nvme10n1
  nvme11   3DE4D285C2027C000.0  nvme_card                                
00000000 pcie   0000:3e:00.0          nvme-subsys11 nvme11n1
  nvme12   3DE4D285C2027C000.1  nvme_card                                
00000000 pcie   0000:3e:00.1          nvme-subsys12 nvme12n1
  nvme2    3DE4D285C2368C001.0  nvme_card                                
00000000 pcie   0000:b7:00.0          nvme-subsys2 nvme2n1
  nvme3    3DE4D285C22A74001.0  nvme_card                                
00000000 pcie   0000:86:00.0          nvme-subsys3 nvme3n1
  nvme4    3DE4D285C22A74001.1  nvme_card                                
00000000 pcie   0000:86:00.1          nvme-subsys4 nvme4n1
  nvme5    3DE4D285C2368C001.1  nvme_card                                
00000000 pcie   0000:b7:00.1          nvme-subsys5 nvme5n1
  nvme6    3DE4D285C21274000.0  nvme_card                                
00000000 pcie   0000:87:00.0          nvme-subsys6 nvme6n1
  nvme7    3DE4D285C21094000.0  nvme_card                                
00000000 pcie   0000:b8:00.0          nvme-subsys7 nvme7n1
  nvme8    3DE4D285C21274000.1  nvme_card                                
00000000 pcie   0000:87:00.1          nvme-subsys8 nvme8n1
  nvme9    3DE4D285C21094000.1  nvme_card                                
00000000 pcie   0000:b8:00.1          nvme-subsys9 nvme9n1
  
  ...
  
  For the output above, drives nvme1n1 and nvme10n1 are under the same
  bridge, and looking the SN it seems it is a dual port NVMe.
  
  - Under the same bridge
  Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike 
in the beginning at ~6150MiB/s.
  
  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme10n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
  ...
  
  - Under different bridge
  Run fio against nvme1n1 and nvme11n1, observe
  
  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme11n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
  ...
  
  ** So far, we haven't been able to reproduce it on another machine, but
  we suspect will be reproducible with any machine with a dual port NVMe.
  
  [Other]
  
  In spreadsheet [2], the are some profiling data for different kernel
  versions, showing consistent performance difference between kernel
  versions.
  
  Offending commit :
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327
  
  Report issue upstream [3].
  
  [1] 
https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
  [2] 
https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169
  [3] 
https://lore.kernel.org/regressions/[email protected]/


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2115738

Title:
  I/O performance regression on NVMes under same bridge (dual port nvme)

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Oracular:
  Won't Fix
Status in linux source package in Plucky:
  In Progress
Status in linux source package in Questing:
  In Progress

Bug description:
  [ Impact ]

  iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes

  The iotlb_sync_map iommu ops allows drivers to perform necessary cache
  flushes when new mappings are established. For the Intel iommu driver,
  this callback specifically serves two purposes:

  - To flush caches when a second-stage page table is attached to a device
    whose iommu is operating in caching mode (CAP_REG.CM==1).
  - To explicitly flush internal write buffers to ensure updates to memory-
    resident remapping structures are visible to hardware (CAP_REG.RWBF==1).

  However, in scenarios where neither caching mode nor the RWBF flag is
  active, the cache_tag_flush_range_np() helper, which is called in the
  iotlb_sync_map path, effectively becomes a no-op.

  Despite being a no-op, cache_tag_flush_range_np() involves iterating
  through all cache tags of the iommu's attached to the domain, protected
  by a spinlock. This unnecessary execution path introduces overhead,
  leading to a measurable I/O performance regression. On systems with NVMes
  under the same bridge, performance was observed to drop from approximately
  ~6150 MiB/s down to ~4985 MiB/s.

  Introduce a flag in the dmar_domain structure. This flag will only be set
  when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
  cache_tag_flush_range_np() is called only for domains where this flag is
  set. This flag, once set, is immutable, given that there won't be mixed
  configurations in real-world scenarios where some IOMMUs in a system
  operate in caching mode while others do not. Theoretically, the
  immutability of this flag does not impact functionality.

  [ Fix ]

  Backport the following commit:
  - 12724ce3fe1a iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF 
modes
  - b9434ba97c44 iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
  - b33125296b50 iommu/vt-d: Create unique domain ops for each stage
  - 0fa6f0893466 iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
  - 85cfaacc9937 iommu/vt-d: Split paging_domain_compatible()
  - cee686775f9c iommu/vt-d: Make iotlb_sync_map a static property of 
dmar_domain
  to Plucky.

  [ Test Plan ]

  Run fio against two NVMEs under the same pci bridge (dual port NVMe):

  $ sudo fio --readwrite=randread --blocksize=4k --iodepth=32
  --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1
  --group_reporting --new_group --name=job1 --filename=/dev/nvmeXnY
  --new_group --name=job2 --filename=/dev/nvmeWnZ

  verify that the speed reached with the two NVMEs under the same bridge
  is the same that would have been reached if the two NVMEs were not
  under the same bridge.

  [ Regression Potential ]

  This fix affects the Intel IOMMU (VT-d) driver.
  An issue with this fix may introduce problems such as
  incorrect omission of required IOTLB cache or write buffer flushes
  when attaching devices to a domain.
  This could result in memory remapping structures not being visible
  to hardware in configurations that actually require synchronization.
  As a consequence, devices performing DMA may exhibit data corruption,
  access violations, or inconsistent behavior due to stale or incomplete
  translations being used by the hardware.

  ---

  [Description]
  A performance regression has been reported when running fio against two NVMe 
devices under the same pci bridge (dual port NVMe).
  The issue was initially reported for 6.11-hwe kernel for Noble.
  The performance regression was introduced in the 6.10 upstream kernel and is 
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
  Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use 
cache_tag_flush_range_np() in iotlb_sync_map").

  In our tests we observe ~6150 MiB/s when the NVMe devices are on
  different bridges and ~4985 MiB/s when under the same brigde.

  Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
  device placement.

  [Test Case]

  We can reproduce the issue on gcp on Z3 metal instance type
  (z3-highmem-192-highlssd-metal) [1].

  You need to have 2 NVMe devices under the same bridge, e.g:

  # nvme list -v
  ...
  Device   SN                   MN                                       FR     
  TxPort Address        Slot   Subsystem    Namespaces
  -------- -------------------- ---------------------------------------- 
-------- ------ -------------- ------ ------------ ----------------
  nvme0    nvme_card-pd         nvme_card-pd                             (null) 
  pcie   0000:05:00.1          nvme-subsys0 nvme0n1
  nvme1    3DE4D285C21A7C001.0  nvme_card                                
00000000 pcie   0000:3d:00.0          nvme-subsys1 nvme1n1
  nvme10   3DE4D285C21A7C001.1  nvme_card                                
00000000 pcie   0000:3d:00.1          nvme-subsys10 nvme10n1
  nvme11   3DE4D285C2027C000.0  nvme_card                                
00000000 pcie   0000:3e:00.0          nvme-subsys11 nvme11n1
  nvme12   3DE4D285C2027C000.1  nvme_card                                
00000000 pcie   0000:3e:00.1          nvme-subsys12 nvme12n1
  nvme2    3DE4D285C2368C001.0  nvme_card                                
00000000 pcie   0000:b7:00.0          nvme-subsys2 nvme2n1
  nvme3    3DE4D285C22A74001.0  nvme_card                                
00000000 pcie   0000:86:00.0          nvme-subsys3 nvme3n1
  nvme4    3DE4D285C22A74001.1  nvme_card                                
00000000 pcie   0000:86:00.1          nvme-subsys4 nvme4n1
  nvme5    3DE4D285C2368C001.1  nvme_card                                
00000000 pcie   0000:b7:00.1          nvme-subsys5 nvme5n1
  nvme6    3DE4D285C21274000.0  nvme_card                                
00000000 pcie   0000:87:00.0          nvme-subsys6 nvme6n1
  nvme7    3DE4D285C21094000.0  nvme_card                                
00000000 pcie   0000:b8:00.0          nvme-subsys7 nvme7n1
  nvme8    3DE4D285C21274000.1  nvme_card                                
00000000 pcie   0000:87:00.1          nvme-subsys8 nvme8n1
  nvme9    3DE4D285C21094000.1  nvme_card                                
00000000 pcie   0000:b8:00.1          nvme-subsys9 nvme9n1

  ...

  For the output above, drives nvme1n1 and nvme10n1 are under the same
  bridge, and looking the SN it seems it is a dual port NVMe.

  - Under the same bridge
  Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike 
in the beginning at ~6150MiB/s.

  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme10n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
  ...

  - Under different bridge
  Run fio against nvme1n1 and nvme11n1, observe

  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme11n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
  ...

  ** So far, we haven't been able to reproduce it on another machine,
  but we suspect will be reproducible with any machine with a dual port
  NVMe.

  [Other]

  In spreadsheet [2], the are some profiling data for different kernel
  versions, showing consistent performance difference between kernel
  versions.

  Offending commit :
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327

  Report issue upstream [3].

  [1] 
https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
  [2] 
https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169
  [3] 
https://lore.kernel.org/regressions/[email protected]/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2115738] Re: I/O performance regression on NVMes under same bridge (dual port nvme)

Reply via email to