and see if there is more information there.
>
> --
> *From:* 肖正刚
> *Sent:* Wednesday, July 22, 2020 8:46 AM
> *To:* Sarlo, Jeffrey S
> *Subject:* Re: [slurm-users] lots of job failed due to node failure
>
> nodes not rebooted/crashed.
> and
If it's Ethernet problem there should be kernel message (dmesg) showing either
link/carrier change or driver reset?
OP's problem could have been caused by excessive paging, check the -M flag of
slurmd? https://slurm.schedmd.com/slurmd.html
Regards,
Angelos
(Sent from mobile, please pardon me fo
Check for Ethernet problems. This happens often enough that I have the
following definition in my .bashrc file to help track these down:
alias flaky_eth='su -c "ssh slurmctld-node grep responding
/var/log/slurm/slurmctld.log"'
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.co