Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread 肖正刚
we checked the slurmd.log,and found "error: service_connection: slurm_receive_msg: Socket timed out on send/recv operation" when job failed, so maybe this is the reason? Sarlo, Jeffrey S 于2020年7月22日周三 下午9:52写道: > OK. > > Though it does look like both were down for around 5 minutes > > [2020-07-

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-22 Thread Ryan Cox
Angelos, I'm glad you mentioned UnkillableStepProgram.  We meant to look at that a while ago but forgot about it.  That will be very useful for us as well, though the answer for us is pretty much always Lustre problems. Ryan On 7/22/20 1:02 PM, Angelos Ching wrote: Agreed. You may also want

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-22 Thread Angelos Ching
Agreed. You may also want to write a script that gather the list of program in "D state" (kernel wait) and print their stack; and configure it as UnkillableStepProgram so that you can capture the program and relevant system callS that caused the job to become unkillable / timed out exiting for f

Re: [slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-22 Thread Ryan Cox
Ivan, Are you having I/O slowness? That is the most common cause for us. If it's not that, you'll want to look through all the reasons that it takes a long time for a process to actually die after a SIGKILL because one of those is the likely cause. Typically it's because the process is waiting

[slurm-users] Nodes going into drain because of "Kill task failed"

2020-07-22 Thread Ivan Kovanda
Dear slurm community, Currently running slurm version 18.08.4 We have been experiencing an issue causing any nodes a slurm job was submitted to to "drain". >From what I've seen, it appears that there is a problem with how slurm is >cleaning up the job with the SIGKILL process. I've found this

[slurm-users] Specifying MPS when using GPUs

2020-07-22 Thread Sajesh Singh
We are deploying 2 compute nodes with Nvidia v100 GPUs and would like to use the CUDA MPS feature. I am not sure as to where to get the number to use for mps when defining the node in the slurm.conf? Any advise would be greatly appreciated. Regards, SS

Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Angelos Ching
If it's Ethernet problem there should be kernel message (dmesg) showing either link/carrier change or driver reset? OP's problem could have been caused by excessive paging, check the -M flag of slurmd? https://slurm.schedmd.com/slurmd.html Regards, Angelos (Sent from mobile, please pardon me fo

Re: [slurm-users] lots of job failed due to node failure

2020-07-22 Thread Riebs, Andy
Check for Ethernet problems. This happens often enough that I have the following definition in my .bashrc file to help track these down: alias flaky_eth='su -c "ssh slurmctld-node grep responding /var/log/slurm/slurmctld.log"' Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.co