On 08/19/2022 10:36 am, Rick Macklem wrote:
On 08/18/2022 9:49 am, Larry Rosenman wrote:
I didn't get all my mail on my bacula backups today (they backup to
NFS mounted TrueNAS).
Also a df hangs.

Here are procstat -kk's for all:
ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯ ps auxxxwww|grep bacula
bacula       2067    0.0  0.0  63188  13652  -  Is   11:30
0:17.49 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c
/usr/local/etc/bacula/bacula-sd.conf
root         2072    0.0  0.0  59280  31276  -  Is   11:30
0:00.31 /usr/local/sbin/bacula-fd -u root -g wheel -v -c
/usr/local/etc/bacula/bacula-fd.conf
bacula       2075    0.0  0.0  86992  19352  -  Is   11:30
0:56.95 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c
/usr/local/etc/bacula/bacula-dir.conf
postgres    50241    0.0  0.1 285764 160244  -  Is   23:05
0:00.38 postgres: bacula bacula [local]  (postgres)
postgres    50244    0.0  0.1 298784  74448  -  Ds   23:05
0:00.67 postgres: bacula bacula [local]  (postgres)
ler         66595    0.0  0.0  12888   2600  3  S+   09:46
0:00.00 grep --color=auto bacula

At the end, I'll list what options are needed for ps and all of its
output is needed. See the end of the email..

ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯ sudo procstat -kk 2067
  PID    TID COMM                TDNAME              KSTACK
 2067 100742 bacula-sd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
amd64_syscall+0x12e fast_syscall_common+0xf8
 2067 101036 bacula-sd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
 2067 101038 bacula-sd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
 2067 124485 bacula-sd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
fast_syscall_common+0xf8

ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯ sudo procstat -kk 2072
  PID    TID COMM                TDNAME              KSTACK
 2072 100677 bacula-fd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
amd64_syscall+0x12e fast_syscall_common+0xf8
 2072 101039 bacula-fd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
 2072 101040 bacula-fd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
 2072 124490 bacula-fd           -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
fast_syscall_common+0xf8

ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯ sudo procstat -kk 2075
  PID    TID COMM                TDNAME              KSTACK
 2075 101007 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_sleep+0x29b umtxq_sleep+0x242 do_wait+0x26b __umtx_op_wait+0x53
sys__umtx_op+0x7e amd64_syscall+0x12e fast_syscall_common+0xf8
 2075 101041 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
 2075 101045 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
amd64_syscall+0x12e fast_syscall_common+0xf8
 2075 101046 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
__umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
fast_syscall_common+0xf8
 2075 101047 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
 2075 124479 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
_cv_wait_sig+0x137 kern_poll_kfds+0x48c kern_poll+0x9f sys_poll+0x50
amd64_syscall+0x12e fast_syscall_common+0xf8
 2075 124480 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
 2075 124481 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
 2075 124489 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
fast_syscall_common+0xf8
 2075 124506 bacula-dir          -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_catch_signals+0x266
sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8

ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯ sudo procstat -kk 66390
  PID    TID COMM                TDNAME              KSTACK
66390 101514 df                  -                   mi_switch+0x157
sleepq_switch+0x107 sleepq_timedwait+0x4b _sleep+0x28e
clnt_reconnect_call+0x809 newnfs_request+0xa95 nfscl_request+0x5a
nfsrpc_statfs+0x19d nfs_statfs+0x148 vfs_statfs_sigdefer+0x2e
kern_getfsstat+0x1f1 sys_getfsstat+0x22 amd64_syscall+0x12e
fast_syscall_common+0xf8

This says that the kernel RPC is trying to establish a new TCP connection
to the server. Now why the old connection stopped working
(which should only happen if there is a network partitioning or
client/server crash), we do not know, from the limited info.

ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯

this was built yesterday:
❯ uname -a
FreeBSD borg.lerctr.org 14.0-CURRENT FreeBSD 14.0-CURRENT #142
ler/freebsd-main-changes-n257453-175a127a72f: Wed Aug 17 09:23:32 CDT
2022
r...@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/LER-MINIMAL
amd64

ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
❯

What else do we need?
Typically, the output of the following commands (all of it, not just some
processes/threads) can be useful:

ps axHlw
- The MWCHAN field is by far the most useful, since it indicates where
the process/thread is sleeping. "H" matters since there are nfsiod threads that do asynchronous I/O for readaheads/writebehinds and knowing what
   they are up to can be useful.

netstat -a
- To see what state the TCP connection(s) being used by the mount are up to.
   (Since it was trying to do a new connection, this one might have
been useful.)

nfsstat -m
- To show exactly what NFS mount options are being used.
--> For example, use of "intr" and/or "soft" for an NFSv4 mount is
      guaranteed to break it. (See BUGS section of "man mount_nfs".)

vmstat -m
vmstat -z
- To see what kernel memory allocations look like. (Basically looking for a very large allocation that might indicate a leak or resource exhaustion.)

procstat -kk
- To be honest, for me, this does not usually give me any useful information
   compared with "ps", but it might be useful.

The rest require going into the debugger (if compiled into your kernel).
sysctl debug.kdb.enter=1
- type <ctrl><alt><escape>
then in the debugger, do "show alllocks" and the other commands
listed here:
https://docs.freebsd.org/doc/6.4-RELEASE/usr/share/doc/en/books/developers-handbook/kerneldebug-deadlocks.html

You can type "continue" to get out of the debugger.

Since it was trying to establish a new TCP connection, all I can say is that
something broke the old TCP connection and that could suggest a
problem with your network fabric, such as a broken TSO implementation.
(NFS traffic is really good at generating weird sizes of TSO segments that, in turn, are very good at finding flaws in TSO implementations that other
 TCP traffic never finds.)

rick


MarkJ pointed out there was a kernel bug that I was probably hitting, upgrading past that point fixed it.

--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106

Reply via email to