Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread 肖正刚
and see if there is more information there. > > -- > *From:* 肖正刚 > *Sent:* Wednesday, July 22, 2020 8:46 AM > *To:* Sarlo, Jeffrey S > *Subject:* Re: [slurm-users] lots of job failed due to node failure > > nodes not rebooted/crashed. > and

Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Angelos Ching
If it's Ethernet problem there should be kernel message (dmesg) showing either link/carrier change or driver reset? OP's problem could have been caused by excessive paging, check the -M flag of slurmd? https://slurm.schedmd.com/slurmd.html Regards, Angelos (Sent from mobile, please pardon me fo

Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Riebs, Andy
Check for Ethernet problems. This happens often enough that I have the following definition in my .bashrc file to help track these down: alias flaky_eth='su -c "ssh slurmctld-node grep responding /var/log/slurm/slurmctld.log"' Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.co