Hi Byron, We ran into this with 20.02, and mitigated it with some kernel tuning. From our sysctl.conf:
net.core.somaxconn = 2048 net.ipv4.tcp_max_syn_backlog = 8192 # prevent neighbour (aka ARP) table overflow... net.ipv4.neigh.default.gc_thresh1 = 30000 net.ipv4.neigh.default.gc_thresh2 = 32000 net.ipv4.neigh.default.gc_thresh3 = 32768 net.ipv4.neigh.default.mcast_solicit = 9 net.ipv4.neigh.default.ucast_solicit = 9 net.ipv4.neigh.default.gc_stale_time = 86400 net.ipv4.neigh.eth0.mcast_solicit = 9 net.ipv4.neigh.eth0.ucast_solicit = 9 net.ipv4.neigh.eth0.gc_stale_time = 86400 # enable selective ack algorithm net.ipv4.tcp_sack = 1 # workaround TIME_WAIT net.ipv4.tcp_tw_reuse = 1 # and since all traffic is local net.ipv4.tcp_fin_timeout = 20 We have a 16-bit cluster network, so the ARP settings date to that. tcp_sack is more of a legacy setting from when some kernels didn't set it. You likely would see tons of connections in TIME_WAIT if you ran "netstat -a" during periods when you're seeing the hangs. Our workaround settings have seemed to mitigate that. On Thu, Jul 28, 2022 at 9:29 AM byron <lbgpub...@gmail.com> wrote: > Hi > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally > (3 times in 2 months) have slurmctld hanging so we get the following > message when running sinfo > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > It only seems to happen when one of our users runs a job that submits a > short lived job every second for 5 days (up to 90,000 in a day). Although > that could be a red-herring. > > There is nothing to be found in the slurmctld log. > > Can anyone suggest how to even start troubleshooting this? Without > anything in the logs I dont know where to start. > > Thanks > >