** Summary changed: - superblock checksum mismatch in resize2fs + Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs
** Description changed: - Hi, - We run ext4 on EBS volumes on EC2. During provisioning, cloud-init will occasionally report that resize2fs has failed due to a superblock checksum mismatch. We debugged this internally, and were able to come up with the following reproducer: + [Impact] + + This is a long running bug plaguing cloud-images, where on a rare + occasion resize2fs would fail and the image would not resize to fit the + entire disk. + + Online resizes would fail due to a superblock checksum mismatch, where + the superblock in memory differs from what is currently on disk due to + changes made to the image. + + Changing the read of the superblock to Direct I/O solves the issue. + + [Testcase] + + Start an c5.large instance on AWS, and attach a 60gb gp3 volume for use + as a scratch disk. + + Run the following script, courtesy of Krister Johansen and his team: #!/usr/bin/bash set -euxo pipefail while true do parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s sleep .5 mkfs.ext4 /dev/nvme1n1p1 mount -t ext4 /dev/nvme1n1p1 /mnt stress-ng --temp-path /mnt -D 4 & STRESS_PID=$! sleep 1 growpart /dev/nvme1n1 1 resize2fs /dev/nvme1n1p1 kill $STRESS_PID wait $STRESS_PID umount /mnt wipefs -a /dev/nvme1n1p1 wipefs -a /dev/nvme1n1 done - (This was on a 60gb gp3 volume attached to a c5.4xlarge) + Test packages are available in the following ppa: - We were able to find a fix that works and get the patch accepted - upstream. The short explanation is that by switching the superblock - read to direct io, we no longer see the problem. + https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test - The patch is available here, but hasn't been published in a released - version of e2fsprogs: + If you install the test packages, the race no longer occurs. - https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84 + [Where problems could occur] - A longer thread with the maintainer is available here: + We are changing how resize2fs reads the superblock from underlying + disks. + If a regression were to occur, resize2fs could fail to resize offline or + online volumes. As all cloud-images are online resized during their + initial boot, this could have a large impact to public and private + clouds should a regression occur. + + [Other info] + + Upstream mailing list discussion: + https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/ https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/ - This bug report is to request that Ubuntu backport this patch to the - versions of e2fsprogs that are in releases that are available in images - on AWS, preferably Focal and Jammy. + This was fixed in the below commit upstream: + + commit 43a498e938887956f393b5e45ea6ac79cc5f4b84 + Author: Theodore Ts'o <ty...@mit.edu> + Date: Thu, 15 Jun 2023 00:17:01 -0400 + Subject: resize2fs: use Direct I/O when reading the superblock for + online resizes + Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84 + + The commit has not been tagged to any release. All supported Ubuntu + releases require this fix, and need to be published in standard non-ESM + archives to be picked up in cloud images. ** Tags added: sts -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu. https://bugs.launchpad.net/bugs/2036467 Title: Resizing cloud-images occasionally fails due to superblock checksum mismatch in resize2fs Status in cloud-images: New Status in e2fsprogs package in Ubuntu: In Progress Status in e2fsprogs source package in Trusty: In Progress Status in e2fsprogs source package in Xenial: In Progress Status in e2fsprogs source package in Bionic: In Progress Status in e2fsprogs source package in Focal: In Progress Status in e2fsprogs source package in Jammy: In Progress Status in e2fsprogs source package in Lunar: In Progress Status in e2fsprogs source package in Mantic: In Progress Bug description: [Impact] This is a long running bug plaguing cloud-images, where on a rare occasion resize2fs would fail and the image would not resize to fit the entire disk. Online resizes would fail due to a superblock checksum mismatch, where the superblock in memory differs from what is currently on disk due to changes made to the image. Changing the read of the superblock to Direct I/O solves the issue. [Testcase] Start an c5.large instance on AWS, and attach a 60gb gp3 volume for use as a scratch disk. Run the following script, courtesy of Krister Johansen and his team: #!/usr/bin/bash set -euxo pipefail while true do parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s sleep .5 mkfs.ext4 /dev/nvme1n1p1 mount -t ext4 /dev/nvme1n1p1 /mnt stress-ng --temp-path /mnt -D 4 & STRESS_PID=$! sleep 1 growpart /dev/nvme1n1 1 resize2fs /dev/nvme1n1p1 kill $STRESS_PID wait $STRESS_PID umount /mnt wipefs -a /dev/nvme1n1p1 wipefs -a /dev/nvme1n1 done Test packages are available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test If you install the test packages, the race no longer occurs. [Where problems could occur] We are changing how resize2fs reads the superblock from underlying disks. If a regression were to occur, resize2fs could fail to resize offline or online volumes. As all cloud-images are online resized during their initial boot, this could have a large impact to public and private clouds should a regression occur. [Other info] Upstream mailing list discussion: https://lore.kernel.org/linux-ext4/20230605225221.ga5...@templeofstupid.com/ https://lore.kernel.org/linux-ext4/20230609042239.ga1436...@mit.edu/ This was fixed in the below commit upstream: commit 43a498e938887956f393b5e45ea6ac79cc5f4b84 Author: Theodore Ts'o <ty...@mit.edu> Date: Thu, 15 Jun 2023 00:17:01 -0400 Subject: resize2fs: use Direct I/O when reading the superblock for online resizes Link: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84 The commit has not been tagged to any release. All supported Ubuntu releases require this fix, and need to be published in standard non- ESM archives to be picked up in cloud images. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp