Hi Bhaskara,
Can you provide some more information? You mentioned you were on focal with
arm64, but can you also let us know:
* What kernel version you see the problem on?
* What filesystem you are using (ext4, xfs, btrfs, ...)?
* Are you resizing during boot, or later in the system lifecycle?
* Can you reproduce the issue consistently? I recall this bug triggered every
10,000 to 20,000 boots or so, it was a very rare race condition that only really
occurs during massive scale. How many boots are you doing? I recall that Krister
in this bug was starting 100,000+ to 120,000+ VMs a day, and was seeing it
4-5 times a day.
* Why now? This has been out for a whole year and the only other issue we had
was three swiotlb SAUCE patches on -azure kernels that were broken by design,
and this particular e2fsprogs SRu triggered the bug. We solved this issue by
reverting those patches:
linux-azure-5.15 (5.15.0-1080.89~20.04.1) focal; urgency=medium
* Miscellaneous upstream changes
- Revert "UBUNTU: SAUCE: swiotlb: Split up single swiotlb lock"
- Revert "UBUNTU: SAUCE: swiotlb: allocate memory in a cache-friendly way"
- Revert "UBUNTU: SAUCE: swiotlb: use bitmap to track free slots"
-- John Cabaj <[email protected]> Mon, 27 Jan 2025 12:07:25 -0600
More details: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/2096813
Are you on -azure? Can you install the latest focal kernel and retest?
* does it also reproduce on jammy? Noble? Questing?
If you have access to the Support Portal, or know someone who does, please file
a Support Case and I will dive into this much deeper.
Thanks,
Matthew
--
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to e2fsprogs in Ubuntu.
https://bugs.launchpad.net/bugs/2036467
Title:
Resizing cloud-images occasionally fails due to superblock checksum
mismatch in resize2fs
Status in cloud-images:
New
Status in e2fsprogs package in Ubuntu:
Fix Released
Status in e2fsprogs source package in Trusty:
Won't Fix
Status in e2fsprogs source package in Xenial:
Won't Fix
Status in e2fsprogs source package in Bionic:
Won't Fix
Status in e2fsprogs source package in Focal:
Fix Released
Status in e2fsprogs source package in Jammy:
Fix Released
Status in e2fsprogs source package in Lunar:
Won't Fix
Status in e2fsprogs source package in Mantic:
Won't Fix
Status in e2fsprogs source package in Noble:
Fix Released
Status in e2fsprogs source package in Oracular:
Fix Released
Bug description:
[Impact]
This is a long running bug plaguing cloud-images, where on a rare
occasion resize2fs would fail and the image would not resize to fit
the entire disk.
Online resizes would fail due to a superblock checksum mismatch, where
the superblock in memory differs from what is currently on disk due to
changes made to the image.
$ resize2fs /dev/nvme1n1p1
resize2fs 1.47.0 (5-Feb-2023)
resize2fs: Superblock checksum does not match superblock while trying to open
/dev/nvme1n1p1
Couldn't find valid filesystem superblock.
Changing the read of the superblock to Direct I/O solves the issue.
[Testcase]
Start an c5.large instance on AWS, and attach a 60gb gp3 volume for
use as a scratch disk.
Run the following script, courtesy of Krister Johansen and his team:
#!/usr/bin/bash
set -euxo pipefail
while true
do
parted /dev/nvme1n1 mklabel gpt mkpart primary 2048s 2099200s
sleep .5
mkfs.ext4 /dev/nvme1n1p1
mount -t ext4 /dev/nvme1n1p1 /mnt
stress-ng --temp-path /mnt -D 4 &
STRESS_PID=$!
sleep 1
growpart /dev/nvme1n1 1
resize2fs /dev/nvme1n1p1
kill $STRESS_PID
wait $STRESS_PID
umount /mnt
wipefs -a /dev/nvme1n1p1
wipefs -a /dev/nvme1n1
done
Test packages are available in the following ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/lp2036467-test
If you install the test packages, the race no longer occurs.
[Where problems could occur]
We are changing how resize2fs reads the superblock from underlying
disks.
If a regression were to occur, resize2fs could fail to resize offline
or online volumes. As all cloud-images are online resized during their
initial boot, this could have a large impact to public and private
clouds should a regression occur.
[Other info]
Upstream mailing list discussion:
https://lore.kernel.org/linux-ext4/[email protected]/
https://lore.kernel.org/linux-ext4/[email protected]/
This was fixed in the below commit upstream:
commit 43a498e938887956f393b5e45ea6ac79cc5f4b84
Author: Theodore Ts'o <[email protected]>
Date: Thu, 15 Jun 2023 00:17:01 -0400
Subject: resize2fs: use Direct I/O when reading the superblock for
online resizes
Link:
https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=43a498e938887956f393b5e45ea6ac79cc5f4b84
The commit has not been tagged to any release. All supported Ubuntu
releases require this fix, and need to be published in standard non-
ESM archives to be picked up in cloud images.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2036467/+subscriptions
--
Mailing list: https://launchpad.net/~touch-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~touch-packages
More help : https://help.launchpad.net/ListHelp