Still trying to track this issue down. Its not just one partition, but
often the entire disk IO locks up with processes stuck. The CF comes up
as ada0 and I dont see any commits that have touched that. the box is a
single GEODE CPU but I tried both SMP and UP kernels and it still seems
to happen. If I play with rtprio on some processes, that *seems* to
trigger the issue more often. I did try a RELENG_14 image on a couple
of test boxes and so far those seem to have survived the weekend without
lockups. It doesnt seem to be memory pressure as available RAM holds
steady from bootup to lockup.
---Mike
On 1/16/2024 9:48 AM, mike tancsa wrote:
Not sure exactly where to start, but I noticed this recently on an
i386 nanobsd image running on old PC Engines Alix devices that had
been rock solid for years. We have a few dozen in the field running
with RELENG_13 from Aug that have been very stable with STABLE over
the years. However, somewhere between Aug 2023 and now I am getting
some lock ups that are difficult to diagnose as the devices are
remote. I did manage to find one odd thing on a local test unit where
a remount of a backup partition is hung.
# ps -auxwwwwp 3443
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 3443 3.3 0.9 4708 2320 - D< 20:18 34:55.20 /sbin/mount
-ur /dev/ada0s4 /logs
I dont have truss on the box to attach to the process and ktrace
doesnt seem to show anything either. Does this sort of hang ring a
bell for anyone ? Looking back at the git logs, a coarse search for
anything to do with mount, doesnt come up with much (2 below). Also
since then a new version of clang so not quite where to start.
Any guidance appreciated. Testing is difficult as the hang doesnt
always happen -- sometimes within a day, sometimes 5 days. ssh is
usually borked as well as some processes. I have a scaled down
telegraf agent collecting some basic stats, and the cpu is pegged at
100%. These are single core devices so not sure what is pegging the
CPU. RAM still shows some available so it doesnt seem to be memory
pressures.
commit 71fceff2480999b3fc921f47ec9adea9eff32041
Author: Andrew Gierth <[email protected]>
Date: Sun Dec 24 14:04:21 2023 +0200
vfs_domount_update(): correct fsidcmp() usage
(cherry picked from commit 2a1d50fc12f6e604da834fbaea961d412aae6e85)
and
commit 608ccfc29fb48d8edc59a97382936790c02d27f3
Author: Konstantin Belousov <[email protected]>
Date: Thu Nov 9 22:18:47 2023 +0200
vfs_domount_update(): ensure that 'goto end' works
PR: 274992
(cherry picked from commit ede4c412b3ea9289ef42c664b01b6b5ff7eac434)