Re: [slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

2021-06-23 Thread Barbara KraĊĦovec
Just in case, increase Slurmdtimeout in slurm.conf (so that when the controller is back, it will give you time to fix the issues with the communication between slurmd and slurmctld - if there will be any). Otherwise it should not affect running and pending jobs. First stop controller, then slur

[slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

2021-06-23 Thread Amjad Syed
Hello all We have a cluster running centos 7 . Our slurm scheduler is running on a vm machine and we are running out of disk space for /var The slurm innodb is taking most of space. We intend to expand the vdisk for slurm server. This will require a reboot for changes to take effect. D

[slurm-users] slurmd running on IBM Power9 systems

2021-06-23 Thread Karl Lovink
Hello, I have compiled the version 20.11.7 for a IBM Power9 system running Ubuntu 18.04. I have slurmd running but in the slurmd.log a predominant error pops up. I did alreay some research but I cannot find a solution. The error is: [2021-06-23T18:02:01.550] error: all available frequencies not s

[slurm-users] New node w/ 3 GPUs is not accepting GPUs tasks

2021-06-23 Thread David Henkemeyer
Hello, I just added a 3rd node to my slurm partition (called "hsw5"), as we continue to enable Slurm in our environment. But the new node is not accepting jobs that require a GPU, despite the fact that it has 3 GPUs. The other node that has a GPU ("devops3") is accepting GPU jobs as expected. A

Re: [slurm-users] Slurm does not set memory.limit_in_bytes for tasks (but does for steps)

2021-06-23 Thread Jacob Chappell
Hi Marcus, That makes sense, thanks! I suppose then (for monitoring purposes, for example, without probing scontrol/sacct) if you wanted to figure out the true maximum memory limit for a task, you'd need to walk up the hierarchy and take whatever the smallest value you find is. __

Re: [slurm-users] Slurm does not set memory.limit_in_bytes for tasks (but does for steps)

2021-06-23 Thread Marcus Wagner
Hi Jacob, I generally think, that that is the better way. If you have e.g. tasks with different memory needs, Slurm (or the oom_killer to be precise) would kill the job, if that limit gets exceeded. If the limit is set for the step, the tasks can "steal" memory from each other. Best Marcus A