On 08/19/2022 10:36 am, Rick Macklem wrote:
On 08/18/2022 9:49 am, Larry Rosenman wrote:
I didn't get all my mail on my bacula backups today (they backup to
NFS mounted TrueNAS).
Also a df hangs.
Here are procstat -kk's for all:
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯ ps auxxxwww|grep bacula
bacula 2067 0.0 0.0 63188 13652 - Is 11:30
0:17.49 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c
/usr/local/etc/bacula/bacula-sd.conf
root 2072 0.0 0.0 59280 31276 - Is 11:30
0:00.31 /usr/local/sbin/bacula-fd -u root -g wheel -v -c
/usr/local/etc/bacula/bacula-fd.conf
bacula 2075 0.0 0.0 86992 19352 - Is 11:30
0:56.95 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c
/usr/local/etc/bacula/bacula-dir.conf
postgres 50241 0.0 0.1 285764 160244 - Is 23:05
0:00.38 postgres: bacula bacula [local] (postgres)
postgres 50244 0.0 0.1 298784 74448 - Ds 23:05
0:00.67 postgres: bacula bacula [local] (postgres)
ler 66595 0.0 0.0 12888 2600 3 S+ 09:46
0:00.00 grep --color=auto bacula
At the end, I'll list what options are needed for ps and all of its
output is needed. See the end of the email..
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯ sudo procstat -kk 2067
PID TID COMM TDNAME KSTACK
2067 100742 bacula-sd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
amd64_syscall+0x12e fast_syscall_common+0xf8
2067 101036 bacula-sd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
2067 101038 bacula-sd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
2067 124485 bacula-sd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
fast_syscall_common+0xf8
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯ sudo procstat -kk 2072
PID TID COMM TDNAME KSTACK
2072 100677 bacula-fd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
amd64_syscall+0x12e fast_syscall_common+0xf8
2072 101039 bacula-fd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
2072 101040 bacula-fd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
2072 124490 bacula-fd - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
fast_syscall_common+0xf8
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯ sudo procstat -kk 2075
PID TID COMM TDNAME KSTACK
2075 101007 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_sleep+0x29b umtxq_sleep+0x242 do_wait+0x26b __umtx_op_wait+0x53
sys__umtx_op+0x7e amd64_syscall+0x12e fast_syscall_common+0xf8
2075 101041 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
2075 101045 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
amd64_syscall+0x12e fast_syscall_common+0xf8
2075 101046 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
2075 101047 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
2075 124479 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_poll_kfds+0x48c kern_poll+0x9f sys_poll+0x50
amd64_syscall+0x12e fast_syscall_common+0xf8
2075 124480 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
2075 124481 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
2075 124489 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
fast_syscall_common+0xf8
2075 124506 bacula-dir - mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯ sudo procstat -kk 66390
PID TID COMM TDNAME KSTACK
66390 101514 df - mi_switch+0x157
sleepq_switch+0x107 sleepq_timedwait+0x4b _sleep+0x28e
clnt_reconnect_call+0x809 newnfs_request+0xa95 nfscl_request+0x5a
nfsrpc_statfs+0x19d nfs_statfs+0x148 vfs_statfs_sigdefer+0x2e
kern_getfsstat+0x1f1 sys_getfsstat+0x22 amd64_syscall+0x12e
fast_syscall_common+0xf8
This says that the kernel RPC is trying to establish a new TCP
connection
to the server. Now why the old connection stopped working
(which should only happen if there is a network partitioning or
client/server crash), we do not know, from the limited info.
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯
this was built yesterday:
❯ uname -a
FreeBSD borg.lerctr.org 14.0-CURRENT FreeBSD 14.0-CURRENT #142
ler/freebsd-main-changes-n257453-175a127a72f: Wed Aug 17 09:23:32 CDT
2022
r...@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/LER-MINIMAL
amd64
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1)
❯
What else do we need?
Typically, the output of the following commands (all of it, not just
some
processes/threads) can be useful:
ps axHlw
- The MWCHAN field is by far the most useful, since it indicates where
the process/thread is sleeping. "H" matters since there are nfsiod
threads
that do asynchronous I/O for readaheads/writebehinds and knowing
what
they are up to can be useful.
netstat -a
- To see what state the TCP connection(s) being used by the mount are
up to.
(Since it was trying to do a new connection, this one might have
been useful.)
nfsstat -m
- To show exactly what NFS mount options are being used.
--> For example, use of "intr" and/or "soft" for an NFSv4 mount is
guaranteed to break it. (See BUGS section of "man mount_nfs".)
vmstat -m
vmstat -z
- To see what kernel memory allocations look like. (Basically looking
for
a very large allocation that might indicate a leak or resource
exhaustion.)
procstat -kk
- To be honest, for me, this does not usually give me any useful
information
compared with "ps", but it might be useful.
The rest require going into the debugger (if compiled into your
kernel).
sysctl debug.kdb.enter=1
- type <ctrl><alt><escape>
then in the debugger, do "show alllocks" and the other commands
listed here:
https://docs.freebsd.org/doc/6.4-RELEASE/usr/share/doc/en/books/developers-handbook/kerneldebug-deadlocks.html
You can type "continue" to get out of the debugger.
Since it was trying to establish a new TCP connection, all I can say is
that
something broke the old TCP connection and that could suggest a
problem with your network fabric, such as a broken TSO implementation.
(NFS traffic is really good at generating weird sizes of TSO segments
that,
in turn, are very good at finding flaws in TSO implementations that
other
TCP traffic never finds.)
rick
MarkJ pointed out there was a kernel bug that I was probably hitting,
upgrading past that point fixed it.
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106