On 2025-05-05 22:36:07, Salvatore Bonaccorso wrote: > Hi Antoine, > > On Mon, May 05, 2025 at 02:50:32PM -0400, Antoine Beaupré wrote: >> On 2025-05-05 18:02:37, Salvatore Bonaccorso wrote: >> > On Mon, May 05, 2025 at 04:00:31PM +0200, Salvatore Bonaccorso wrote: >> >> Hi Moritz, >> >> >> >> On Mon, May 05, 2025 at 01:47:15PM +0200, Moritz Mühlenhoff wrote: >> >> > Am Wed, Apr 30, 2025 at 05:55:20PM +0200 schrieb Salvatore Bonaccorso: >> >> > > Hi >> >> > > >> >> > > We got a regression report in Debian after the update from 6.1.133 to >> >> > > 6.1.135. Melvin is reporting that discard/trimm trhough a RAID10 array >> >> > > stalls idefintively. The full report is inlined below and originates >> >> > > from https://bugs.debian.org/1104460 . >> >> > >> >> > JFTR, we ran into the same problem with a few Wikimedia servers running >> >> > 6.1.135 and RAID 10: The servers started to lock up once fstrim.service >> >> > got started. Full oops messages are available at >> >> > https://phabricator.wikimedia.org/P75746 >> >> >> >> Thanks for this aditional datapoints. Assuming you wont be able to >> >> thest the other stable series where the commit d05af90d6218 >> >> ("md/raid10: fix missing discard IO accounting") went in, might you at >> >> least be able to test the 6.1.y branch with the commit reverted again >> >> and manually trigger the issue? >> >> >> >> If needed I can provide a test Debian package of 6.1.135 (or 6.1.137) >> >> with the patch reverted. >> > >> > So one additional data point as several Debian users were reporting >> > back beeing affected: One user did upgrade to 6.12.25 (where the >> > commit was backported as well) and is not able to reproduce the issue >> > there. >> >> That would be me. >> >> I can reproduce the issue as outlined by Moritz above fairly reliably in >> 6.1.135 (debian package 6.1.0-34-amd64). The reproducer is simple, on a >> RAID-10 host: >> >> 1. reboot >> 2. systemctl start fstrim.service >> >> We're tracking the issue internally in: >> >> https://gitlab.torproject.org/tpo/tpa/team/-/issues/42146 >> >> I've managed to workaround the issue by upgrading to the Debian package >> from testing/unstable (6.12.25), as Salvatore indicated above. There, >> fstrim doesn't cause any crash and completes successfully. In stable, it >> just hangs there forever. The kernel doesn't completely panic and the >> machine is otherwise somewhat still functional: my existing SSH >> connection keeps working, for example, but new ones fail. And an `apt >> install` of another kernel hangs forever. > > So likely at least in 6.1.y there are missing pre-requisites causing > the behaviour. > > If you can test 6.1.135-1 with the commit > 4a05f7ae33716d996c5ce56478a36a3ede1d76f2 reverted then you can fetch > built packages at: > > https://people.debian.org/~carnil/tmp/linux/1104460/
I can confirm this kernel does not crash when running fstrim.service, which seems to confirm the bisect. A. -- Drowning people Sometimes die Fighting their rescuers. - Octavia Butler