Hi,
Yesterday, an upgrade to slurm from 22.05.4 to 23.11.0 went sideways and I
ended up losing a number of jobs on the compute nodes. Ultimately, the
installation seems to be successful but I now have some issues with job
remnants it appears.About once per minute (per job), the slurmctld
daem
Hi Ole,
for multiple reasons we build it ourself, but I am not really involved
in that process, but I will contact the person who is. Thanks for the
recommendation! We should probably implement a regular check whether
there is a new slurm version. I am not 100% whether this will fix our
issues or
On 12/6/23 11:51, Xaver Stiensmeier wrote:
Good idea. Here's our current version:
```
sinfo -V
slurm 22.05.7
```
Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.
There are nice bug fixes in 23.02 mentioned in my SLU
Hi Ole,
Good idea. Here's our current version:
```
sinfo -V
slurm 22.05.7
```
Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.
Xaver
On 06.12.23 11:09, Ole Holm Nielsen wrote:
Hi Xaver,
Your version of Slurm may m
Hi Xaver,
Your version of Slurm may matter for your power saving experience. Do you
run an updated version?
/Ole
On 12/6/23 10:54, Xaver Stiensmeier wrote:
Hi Ole,
I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without e
Hi Ole,
I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not a
Hi Xavier,
On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out of
many (>242) node starts that resulted in
|slurm_update error: Invalid node state specified|
when we called:
|scontrol update NodeName="$1" state=RESUME reason=FailedSt
Hi Joseph,
This might depend on the rest of your configuration, but in general swap
should not be needed for anything on Linux.
BUT: you might get OOM killer messages in your system logs, and SLURM
might fall victim to the OOM killer (OOM = Out Of Memory) if you run
applications on the compute
Dear Slurm User list,
using https://slurm.schedmd.com/power_save.html we had one case out of
many (>242) node starts that resulted in
|slurm_update error: Invalid node state specified|
when we called:
|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|
in the Fail script. We run