we checked the slurmd.log,and found "error: service_connection:
slurm_receive_msg: Socket timed out on send/recv operation" when job
failed, so maybe this is the reason?
Sarlo, Jeffrey S 于2020年7月22日周三 下午9:52写道:
> OK.
>
> Though it does look like both were down for around 5 minutes
>
> [2020-07-
Angelos,
I'm glad you mentioned UnkillableStepProgram. We meant to look at that
a while ago but forgot about it. That will be very useful for us as
well, though the answer for us is pretty much always Lustre problems.
Ryan
On 7/22/20 1:02 PM, Angelos Ching wrote:
Agreed. You may also want
Agreed. You may also want to write a script that gather the list of program in
"D state" (kernel wait) and print their stack; and configure it as
UnkillableStepProgram so that you can capture the program and relevant system
callS that caused the job to become unkillable / timed out exiting for f
Ivan,
Are you having I/O slowness? That is the most common cause for us. If
it's not that, you'll want to look through all the reasons that it takes
a long time for a process to actually die after a SIGKILL because one of
those is the likely cause. Typically it's because the process is waiting
Dear slurm community,
Currently running slurm version 18.08.4
We have been experiencing an issue causing any nodes a slurm job was submitted
to to "drain".
>From what I've seen, it appears that there is a problem with how slurm is
>cleaning up the job with the SIGKILL process.
I've found this
We are deploying 2 compute nodes with Nvidia v100 GPUs and would like to use
the CUDA MPS feature. I am not sure as to where to get the number to use for
mps when defining the node in the slurm.conf?
Any advise would be greatly appreciated.
Regards,
SS
If it's Ethernet problem there should be kernel message (dmesg) showing either
link/carrier change or driver reset?
OP's problem could have been caused by excessive paging, check the -M flag of
slurmd? https://slurm.schedmd.com/slurmd.html
Regards,
Angelos
(Sent from mobile, please pardon me fo
Check for Ethernet problems. This happens often enough that I have the
following definition in my .bashrc file to help track these down:
alias flaky_eth='su -c "ssh slurmctld-node grep responding
/var/log/slurm/slurmctld.log"'
Andy
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.co