If it's Ethernet problem there should be kernel message (dmesg) showing either link/carrier change or driver reset?
OP's problem could have been caused by excessive paging, check the -M flag of slurmd? https://slurm.schedmd.com/slurmd.html Regards, Angelos (Sent from mobile, please pardon me for typos and cursoriness.) > 2020/07/22 19:48、Riebs, Andy <andy.ri...@hpe.com>のメール: > > > Check for Ethernet problems. This happens often enough that I have the > following definition in my .bashrc file to help track these down: > > alias flaky_eth='su -c "ssh slurmctld-node grep responding > /var/log/slurm/slurmctld.log"' > > Andy > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of > ??? > Sent: Tuesday, July 21, 2020 8:41 PM > To: slurm-users@lists.schedmd.com > Subject: [slurm-users] lots of job failed due to node failure > > Hi,all > We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of job > failed due to node failure; check slumctld.log we found nodes are set to > down stat then resumed quikly. > some log info: > [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding > [2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN > [2020-07-20T00:26:23.725] error: Nodes j1802 not responding > [2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN > [2020-07-20T00:26:46.602] Node j1608 now responding > [2020-07-20T00:26:49.449] Node j1802 now responding > > Anyone hit this issue beforce ? > Any suggestions will help. > > Regards.