** Description changed: BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. + commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 + Author: Xiao Ni <x...@redhat.com> + Date: Tue Aug 25 13:42:59 2020 +0800 + Subject: md: add md_submit_discard_bio() for submitting discard bio + Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 + + commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 + Author: Xiao Ni <x...@redhat.com> + Date: Tue Aug 25 13:43:00 2020 +0800 + Subject: md/raid10: extend r10bio devs to raid disks + Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 + + commit f046f5d0d79cdb968f219ce249e497fd1accf484 + Author: Xiao Ni <x...@redhat.com> + Date: Tue Aug 25 13:43:01 2020 +0800 + Subject: md/raid10: pull codes that wait for blocked dev into one function + Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 + + commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 + Author: Xiao Ni <x...@redhat.com> + Date: Wed Sep 2 20:00:22 2020 +0800 + Subject: md/raid10: improve raid10 discard request + Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 + commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <x...@redhat.com> - Date: Wed Sep 2 20:00:23 2020 +0800 + Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 - - commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 - Author: Xiao Ni <x...@redhat.com> - Date: Wed Sep 2 20:00:22 2020 +0800 - Subject: md/raid10: improve raid10 discard request - Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 - - commit f046f5d0d79cdb968f219ce249e497fd1accf484 - Author: Xiao Ni <x...@redhat.com> - Date: Tue Aug 25 13:43:01 2020 +0800 - Subject: md/raid10: pull codes that wait for blocked dev into one function - Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 - - commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 - Author: Xiao Ni <x...@redhat.com> - Date: Tue Aug 25 13:43:00 2020 +0800 - Subject: md/raid10: extend r10bio devs to raid disks - Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 - - commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 - Author: Xiao Ni <x...@redhat.com> - Date: Tue Aug 25 13:42:59 2020 +0800 - Subject: md: add md_submit_discard_bio() for submitting discard bio - Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snit...@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snit...@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <s...@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely.
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1896578 Title: raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Focal: In Progress Status in linux source package in Groovy: In Progress Bug description: BugLink: https://bugs.launchpad.net/bugs/1896578 [Impact] Block discard is very slow on Raid10, which causes common use cases which invoke block discard, such as mkfs and fstrim operations, to take a very long time. For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. The bigger the devices, the longer it takes. The cause is that Raid10 currently uses a 512k chunk size, and uses this for the discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the request into millions of 512k bio requests, even if the underlying device supports larger requests. For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: $ cat /sys/block/nvme0n1/queue/discard_max_bytes 2199023255040 $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes 2199023255040 Where the Raid10 md device only supports 512k: $ cat /sys/block/md0/queue/discard_max_bytes 524288 $ cat /sys/block/md0/queue/discard_max_hw_bytes 524288 If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes and if we examine the stack, it is stuck in blkdev_issue_discard() $ sudo cat /proc/1626/stack [<0>] wait_barrier+0x14c/0x230 [raid10] [<0>] regular_request_wait+0x39/0x150 [raid10] [<0>] raid10_write_request+0x11e/0x850 [raid10] [<0>] raid10_make_request+0xd7/0x150 [raid10] [<0>] md_handle_request+0x123/0x1a0 [<0>] md_submit_bio+0xda/0x120 [<0>] __submit_bio_noacct+0xde/0x320 [<0>] submit_bio_noacct+0x4d/0x90 [<0>] submit_bio+0x4f/0x1b0 [<0>] __blkdev_issue_discard+0x154/0x290 [<0>] blkdev_issue_discard+0x5d/0xc0 [<0>] blk_ioctl_discard+0xc4/0x110 [<0>] blkdev_common_ioctl+0x56c/0x840 [<0>] blkdev_ioctl+0xeb/0x270 [<0>] block_ioctl+0x3d/0x50 [<0>] __x64_sys_ioctl+0x91/0xc0 [<0>] do_syscall_64+0x38/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [Fix] Xiao Ni has developed a patchset which resolves the block discard performance problems. These commits have now landed in 5.10-rc1. commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 Author: Xiao Ni <x...@redhat.com> Date: Tue Aug 25 13:42:59 2020 +0800 Subject: md: add md_submit_discard_bio() for submitting discard bio Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 Author: Xiao Ni <x...@redhat.com> Date: Tue Aug 25 13:43:00 2020 +0800 Subject: md/raid10: extend r10bio devs to raid disks Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 commit f046f5d0d79cdb968f219ce249e497fd1accf484 Author: Xiao Ni <x...@redhat.com> Date: Tue Aug 25 13:43:01 2020 +0800 Subject: md/raid10: pull codes that wait for blocked dev into one function Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 Author: Xiao Ni <x...@redhat.com> Date: Wed Sep 2 20:00:22 2020 +0800 Subject: md/raid10: improve raid10 discard request Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 Author: Xiao Ni <x...@redhat.com> Date: Wed Sep 2 20:00:23 2020 +0800 Subject: md/raid10: improve discard request for far layout Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 There is also an additional commit which is required, and was merged after "md/raid10: improve raid10 discard request" was merged. The following commits enable Radid10 to use large discards, instead of splitting into many bios, since the technical hurdles have now been removed. commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 Author: Mike Snitzer <snit...@redhat.com> Date: Thu Sep 24 13:14:52 2020 -0400 Subject: dm raid: fix discard limits for raid1 and raid10 Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 Author: Mike Snitzer <snit...@redhat.com> Date: Thu Sep 24 16:40:12 2020 -0400 Subject: dm raid: remove unnecessary discard limits for raid10 Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 All the commits mentioned follow a similar strategy which was implemented in Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block discard performance issues in Raid0: commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 Author: Shaohua Li <s...@fb.com> Date: Sun May 7 17:36:24 2017 -0700 Subject: md/md0: optimize raid0 discard handling Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ time sudo mkfs.xfs /dev/md0 real 11m14.734s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Optional, do a fstrim: $ time sudo fstrim /mnt/disk real 11m37.643s I built a test kernel based on 5.9-rc6 with the above patches, and we can see that performance dramatically improves: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 $ time sudo mkfs.xfs /dev/md0 real 0m4.226s user 0m0.020s sys 0m0.148s $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk $ time sudo fstrim /mnt/disk real 0m1.991s user 0m0.020s sys 0m0.000s The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim from 11 minutes to 2 seconds. The test kernel also changes the discard_max_bytes to the underlying hardware limit: $ cat /sys/block/md0/queue/discard_max_bytes 2199023255040 [Regression Potential] If a regression were to occur, then it would affect operations which would trigger block discard operations, such as mkfs and fstrim, on Raid10 only. Other Raid levels would not be affected, although, I should note there will be a small risk of regression to Raid0, due to one of its functions being re-factored and split out, for use in both Raid0 and Raid10. The changes only affect block discard, so only Raid10 arrays backed by SSD or NVMe devices which support block discard will be affected. Traditional hard disks, or SSD devices which do not support block discard would not be affected. If a regression were to occur, users could work around the issue by running "mkfs.xfs -K <device>" which would skip block discard entirely. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp