** Description changed:

  [Description]
  A performance regression has been reported when running fio against two NVMe 
devices under the same pci bridge (dual port NVMe).
  The issue was initially reported for 6.11-hwe kernel for Noble.
  The performance regression was introduced in the 6.10 upstream kernel and is 
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
  Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use 
cache_tag_flush_range_np() in iotlb_sync_map").
  
  In our tests we observe ~6150 MiB/s when the NVMe devices are on
  different bridges and ~4985 MiB/s when under the same brigde.
  
  Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
  device placement.
  
  [Test Case]
  
  We can reproduce the issue on gcp on Z3 metal instance type
  (z3-highmem-192-highlssd-metal) [1].
  
  You need to have 2 NVMe devices under the same bridge, e.g:
  
  # nvme list -v
  ...
  Device   SN                   MN                                       FR     
  TxPort Address        Slot   Subsystem    Namespaces
  -------- -------------------- ---------------------------------------- 
-------- ------ -------------- ------ ------------ ----------------
  nvme0    nvme_card-pd         nvme_card-pd                             (null) 
  pcie   0000:05:00.1          nvme-subsys0 nvme0n1
  nvme1    3DE4D285C21A7C001.0  nvme_card                                
00000000 pcie   0000:3d:00.0          nvme-subsys1 nvme1n1
  nvme10   3DE4D285C21A7C001.1  nvme_card                                
00000000 pcie   0000:3d:00.1          nvme-subsys10 nvme10n1
  nvme11   3DE4D285C2027C000.0  nvme_card                                
00000000 pcie   0000:3e:00.0          nvme-subsys11 nvme11n1
  nvme12   3DE4D285C2027C000.1  nvme_card                                
00000000 pcie   0000:3e:00.1          nvme-subsys12 nvme12n1
  nvme2    3DE4D285C2368C001.0  nvme_card                                
00000000 pcie   0000:b7:00.0          nvme-subsys2 nvme2n1
  nvme3    3DE4D285C22A74001.0  nvme_card                                
00000000 pcie   0000:86:00.0          nvme-subsys3 nvme3n1
  nvme4    3DE4D285C22A74001.1  nvme_card                                
00000000 pcie   0000:86:00.1          nvme-subsys4 nvme4n1
  nvme5    3DE4D285C2368C001.1  nvme_card                                
00000000 pcie   0000:b7:00.1          nvme-subsys5 nvme5n1
  nvme6    3DE4D285C21274000.0  nvme_card                                
00000000 pcie   0000:87:00.0          nvme-subsys6 nvme6n1
  nvme7    3DE4D285C21094000.0  nvme_card                                
00000000 pcie   0000:b8:00.0          nvme-subsys7 nvme7n1
  nvme8    3DE4D285C21274000.1  nvme_card                                
00000000 pcie   0000:87:00.1          nvme-subsys8 nvme8n1
  nvme9    3DE4D285C21094000.1  nvme_card                                
00000000 pcie   0000:b8:00.1          nvme-subsys9 nvme9n1
  
  ...
  
  For the output above, drives nvme1n1 and nvme10n1 are under the same
  bridge, and looking the SN it seems it is a dual port NVMe.
  
  - Under the same bridge
  Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike 
in the beginning at ~6150MiB/s.
  
  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme10n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
  ...
  
  - Under different bridge
  Run fio against nvme1n1 and nvme11n1, observe
  
  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme11n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
  ...
  
  ** So far, we haven't been able to reproduce it on another machine, but
  we suspect will be reproducible with any machine with a dual port NVMe.
  
  [Other]
  
+ In spreadsheet [2], the are some profiling data for different kernel
+ versions, showing consistent performance difference between kernel
+ versions.
+ 
+ 
  Offending commit :
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327
  
  [1] https://cloud.google.com/compute/docs/storage-optimized-
  machines#z3_machine_types

** Description changed:

  [Description]
  A performance regression has been reported when running fio against two NVMe 
devices under the same pci bridge (dual port NVMe).
  The issue was initially reported for 6.11-hwe kernel for Noble.
  The performance regression was introduced in the 6.10 upstream kernel and is 
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
  Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use 
cache_tag_flush_range_np() in iotlb_sync_map").
  
  In our tests we observe ~6150 MiB/s when the NVMe devices are on
  different bridges and ~4985 MiB/s when under the same brigde.
  
  Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
  device placement.
  
  [Test Case]
  
  We can reproduce the issue on gcp on Z3 metal instance type
  (z3-highmem-192-highlssd-metal) [1].
  
  You need to have 2 NVMe devices under the same bridge, e.g:
  
  # nvme list -v
  ...
  Device   SN                   MN                                       FR     
  TxPort Address        Slot   Subsystem    Namespaces
  -------- -------------------- ---------------------------------------- 
-------- ------ -------------- ------ ------------ ----------------
  nvme0    nvme_card-pd         nvme_card-pd                             (null) 
  pcie   0000:05:00.1          nvme-subsys0 nvme0n1
  nvme1    3DE4D285C21A7C001.0  nvme_card                                
00000000 pcie   0000:3d:00.0          nvme-subsys1 nvme1n1
  nvme10   3DE4D285C21A7C001.1  nvme_card                                
00000000 pcie   0000:3d:00.1          nvme-subsys10 nvme10n1
  nvme11   3DE4D285C2027C000.0  nvme_card                                
00000000 pcie   0000:3e:00.0          nvme-subsys11 nvme11n1
  nvme12   3DE4D285C2027C000.1  nvme_card                                
00000000 pcie   0000:3e:00.1          nvme-subsys12 nvme12n1
  nvme2    3DE4D285C2368C001.0  nvme_card                                
00000000 pcie   0000:b7:00.0          nvme-subsys2 nvme2n1
  nvme3    3DE4D285C22A74001.0  nvme_card                                
00000000 pcie   0000:86:00.0          nvme-subsys3 nvme3n1
  nvme4    3DE4D285C22A74001.1  nvme_card                                
00000000 pcie   0000:86:00.1          nvme-subsys4 nvme4n1
  nvme5    3DE4D285C2368C001.1  nvme_card                                
00000000 pcie   0000:b7:00.1          nvme-subsys5 nvme5n1
  nvme6    3DE4D285C21274000.0  nvme_card                                
00000000 pcie   0000:87:00.0          nvme-subsys6 nvme6n1
  nvme7    3DE4D285C21094000.0  nvme_card                                
00000000 pcie   0000:b8:00.0          nvme-subsys7 nvme7n1
  nvme8    3DE4D285C21274000.1  nvme_card                                
00000000 pcie   0000:87:00.1          nvme-subsys8 nvme8n1
  nvme9    3DE4D285C21094000.1  nvme_card                                
00000000 pcie   0000:b8:00.1          nvme-subsys9 nvme9n1
  
  ...
  
  For the output above, drives nvme1n1 and nvme10n1 are under the same
  bridge, and looking the SN it seems it is a dual port NVMe.
  
  - Under the same bridge
  Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike 
in the beginning at ~6150MiB/s.
  
  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme10n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
  ...
  
  - Under different bridge
  Run fio against nvme1n1 and nvme11n1, observe
  
  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme11n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
  ...
  
  ** So far, we haven't been able to reproduce it on another machine, but
  we suspect will be reproducible with any machine with a dual port NVMe.
  
  [Other]
  
  In spreadsheet [2], the are some profiling data for different kernel
  versions, showing consistent performance difference between kernel
  versions.
  
- 
  Offending commit :
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327
  
- [1] https://cloud.google.com/compute/docs/storage-optimized-
- machines#z3_machine_types
+ [1] 
https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
+ [2] 
https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2115738

Title:
  I/O performance regression on NVMes under same bridge (dual port nvme)

Status in linux package in Ubuntu:
  New
Status in linux source package in Oracular:
  New
Status in linux source package in Plucky:
  New
Status in linux source package in Questing:
  New

Bug description:
  [Description]
  A performance regression has been reported when running fio against two NVMe 
devices under the same pci bridge (dual port NVMe).
  The issue was initially reported for 6.11-hwe kernel for Noble.
  The performance regression was introduced in the 6.10 upstream kernel and is 
still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
  Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use 
cache_tag_flush_range_np() in iotlb_sync_map").

  In our tests we observe ~6150 MiB/s when the NVMe devices are on
  different bridges and ~4985 MiB/s when under the same brigde.

  Before the offending commit we observe ~6150 MiB/s, regardless of NVMe
  device placement.

  [Test Case]

  We can reproduce the issue on gcp on Z3 metal instance type
  (z3-highmem-192-highlssd-metal) [1].

  You need to have 2 NVMe devices under the same bridge, e.g:

  # nvme list -v
  ...
  Device   SN                   MN                                       FR     
  TxPort Address        Slot   Subsystem    Namespaces
  -------- -------------------- ---------------------------------------- 
-------- ------ -------------- ------ ------------ ----------------
  nvme0    nvme_card-pd         nvme_card-pd                             (null) 
  pcie   0000:05:00.1          nvme-subsys0 nvme0n1
  nvme1    3DE4D285C21A7C001.0  nvme_card                                
00000000 pcie   0000:3d:00.0          nvme-subsys1 nvme1n1
  nvme10   3DE4D285C21A7C001.1  nvme_card                                
00000000 pcie   0000:3d:00.1          nvme-subsys10 nvme10n1
  nvme11   3DE4D285C2027C000.0  nvme_card                                
00000000 pcie   0000:3e:00.0          nvme-subsys11 nvme11n1
  nvme12   3DE4D285C2027C000.1  nvme_card                                
00000000 pcie   0000:3e:00.1          nvme-subsys12 nvme12n1
  nvme2    3DE4D285C2368C001.0  nvme_card                                
00000000 pcie   0000:b7:00.0          nvme-subsys2 nvme2n1
  nvme3    3DE4D285C22A74001.0  nvme_card                                
00000000 pcie   0000:86:00.0          nvme-subsys3 nvme3n1
  nvme4    3DE4D285C22A74001.1  nvme_card                                
00000000 pcie   0000:86:00.1          nvme-subsys4 nvme4n1
  nvme5    3DE4D285C2368C001.1  nvme_card                                
00000000 pcie   0000:b7:00.1          nvme-subsys5 nvme5n1
  nvme6    3DE4D285C21274000.0  nvme_card                                
00000000 pcie   0000:87:00.0          nvme-subsys6 nvme6n1
  nvme7    3DE4D285C21094000.0  nvme_card                                
00000000 pcie   0000:b8:00.0          nvme-subsys7 nvme7n1
  nvme8    3DE4D285C21274000.1  nvme_card                                
00000000 pcie   0000:87:00.1          nvme-subsys8 nvme8n1
  nvme9    3DE4D285C21094000.1  nvme_card                                
00000000 pcie   0000:b8:00.1          nvme-subsys9 nvme9n1

  ...

  For the output above, drives nvme1n1 and nvme10n1 are under the same
  bridge, and looking the SN it seems it is a dual port NVMe.

  - Under the same bridge
  Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike 
in the beginning at ~6150MiB/s.

  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme10n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
  ...

  - Under different bridge
  Run fio against nvme1n1 and nvme11n1, observe

  # sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 
--time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting 
--new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 
--filename=/dev/nvme11n1
  ...
  Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
  ...

  ** So far, we haven't been able to reproduce it on another machine,
  but we suspect will be reproducible with any machine with a dual port
  NVMe.

  [Other]

  In spreadsheet [2], the are some profiling data for different kernel
  versions, showing consistent performance difference between kernel
  versions.

  Offending commit :
  
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327

  [1] 
https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
  [2] 
https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to