On 18/03/2019 23.07, Eric Rosenberg wrote:
> [2019-03-15T09:48:43.000] update_node: node rn003 reason set to: Kill task
> failed
This usually happens for me when one of the shared filesystems
is overloadedand processes are stuck in uninterruptible sleep
(D), thus unableto terminate.
Your reason
Hi,
I made two QOSes in Slurm to define two levels of priorities and preemption.
And, I can limit, for instance, the maximum number of high-priority jobs
running or pending submitted by each user, to 5 by setting
MaxSubmitJobsPerUser=5 for the high-priority QOS.
But if I want to give more quota
Hello,
I've set up a few nodes on slurm to test with and am having trouble. It
seems that once a job has met it's wall time, the node that it ran on
enters the comp state then remains in the drain state until I manually set
the state to resume.
Looking at the slurm log on the head node, I see the