we checked the slurmd.log,and found "error: service_connection: slurm_receive_msg: Socket timed out on send/recv operation" when job failed, so maybe this is the reason?
Sarlo, Jeffrey S <jsa...@central.uh.edu> 于2020年7月22日周三 下午9:52写道: > OK. > > Though it does look like both were down for around 5 minutes > > [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding > [2020-07-20T00:26:46.602] Node j1608 now responding > [2020-07-20T00:26:49.449] Node j1802 now responding > > > You might want to check the slurmd.log file on the compute nodes > themselves and see if there is more information there. > > ------------------------------ > *From:* 肖正刚 <guru.nov...@gmail.com> > *Sent:* Wednesday, July 22, 2020 8:46 AM > *To:* Sarlo, Jeffrey S <jsa...@central.uh.edu> > *Subject:* Re: [slurm-users] lots of job failed due to node failure > > nodes not rebooted/crashed. > and from the log you can see node j1802 status resumed within one minutes. > > Sarlo, Jeffrey S <jsa...@central.uh.edu> 于2020年7月22日周三 下午7:58写道: > > If you log into a node after you see that, had the node rebooted/crashed? > Maybe a job is crashing the node or there is a hardware issue with the node. > > Jeff > > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > 肖正刚 <guru.nov...@gmail.com> > *Sent:* Tuesday, July 21, 2020 7:40 PM > *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> > *Subject:* [slurm-users] lots of job failed due to node failure > > Hi,all > We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of > job failed due to node failure; check slumctld.log we found nodes are set > to down stat then resumed quikly. > some log info: > [2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding > [2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN > [2020-07-20T00:26:23.725] error: Nodes j1802 not responding > [2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN > [2020-07-20T00:26:46.602] Node j1608 now responding > [2020-07-20T00:26:49.449] Node j1802 now responding > > Anyone hit this issue beforce ? > Any suggestions will help. > > Regards. > >