Re: [slurm-users] lots of job failed due to node failure

Angelos Ching Wed, 22 Jul 2020 04:58:25 -0700

If it's Ethernet problem there should be kernel message (dmesg) showing either 
link/carrier change or driver reset?


OP's problem could have been caused by excessive paging, check the -M flag of 
slurmd? https://slurm.schedmd.com/slurmd.html

Regards,
Angelos
(Sent from mobile, please pardon me for typos and cursoriness.)

> 2020/07/22 19:48、Riebs, Andy <andy.ri...@hpe.com>のメール:
> 
> 
> Check for Ethernet problems. This happens often enough that I have the 
> following definition in my .bashrc file to help track these down:
>  
> alias flaky_eth='su -c "ssh slurmctld-node grep responding 
> /var/log/slurm/slurmctld.log"'
>  
> Andy
>  
> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
> ???
> Sent: Tuesday, July 21, 2020 8:41 PM
> To: slurm-users@lists.schedmd.com
> Subject: [slurm-users] lots of job failed due to node failure
>  
> Hi,all
> We run slurm 19.05 on a cluster about 1k nodes，recently, we found lots of job 
> failed due to node failure; check slumctld.log we found  nodes are set to 
> down stat then resumed quikly.
> some log info:
> [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
> [2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN
> [2020-07-20T00:26:23.725] error: Nodes j1802 not responding
> [2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN
> [2020-07-20T00:26:46.602] Node j1608 now responding
> [2020-07-20T00:26:49.449] Node j1802 now responding
>  
> Anyone hit this issue beforce ?
> Any suggestions will help.
>  
> Regards.

Re: [slurm-users] lots of job failed due to node failure

Reply via email to