We are pleased to announce the availability of Slurm version 22.05.4.
This includes fixes to two potential crashes in the backfill scheduler, alongside a number of other moderate severity issues.
Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support
* Changes in Slurm 22.05.4 ========================== -- Fix return code from salloc when the job is revoked prior to executing user command. -- Fix minor memory leak when dealing with gres with multiple files. -- Fix printing for no_consume gres in scontrol show job. -- sinfo - Fix truncation of very large values when outputting memory. -- Fix multi-node step launch failure when nodes in the controller aren't in natural order. This can happen with inconsistent node naming (such as node15 and node052) or with dynamic nodes which can register in any order. -- job_container/tmpfs - Prevent reading the plugin config multiple times per step. -- Fix wrong attempt of gres binding for gres w/out cores defined. -- Fix build to work with '--without-shared-libslurm' configure flag. -- Fix power_save mode when repeatedly configuring too fast. -- Fix sacct -I option. -- Prevent jobs from being scheduled on future nodes. -- Fix memory leak in slurmd happening on reconfigure when CPUSpecList used. -- Fix sacctmgr show event [min|max]cpus. -- Fix regression in 22.05.0rc1 where a prolog or epilog that redirected stdout to a file could get erroneously killed, resulting in job launch failure (for the prolog) and the node being drained. -- cgroup/v1 - Make a static variable to remove potential redundant checking for if the system has swap or not. -- cgroup/v1 - Add check for swap when running OOM check after task termination. -- job_submit/lua - add --prefer support -- cgroup/v1 - fix issue where sibling steps could incorrectly be accounted as OOM when step memory limit was the same as the job allocation. Detect OOM events via memory.oom_control oom_kill when exposed by the kernel instead of subscribing notifications with eventfd. -- Fix accounting of oom_kill events in cgroup/v2 and task/cgroup. -- Fix segfault when slurmd reports less than configured gres with links after a slurmctld restart. -- Fix TRES counts after node is deleted using scontrol. -- sched/backfill - properly handle multi-reservation HetJobs. -- sched/backfill - don't try to start HetJobs after system state change. -- openapi/v0.0.38 - add submission of job->prefer value. -- slurmdbd - become SlurmUser at the same point in logic as slurmctld to match plugins initialization behavior. This avoids a fatal error when starting slurmdbd as root and root cannot start the auth or accounting_storage plugins (for example, if root cannot read the jwt key). -- Fix memory leak when attempting to update a job's features with invalid features. -- Fix occasional slurmctld crash or hang in backfill due to invalid pointers. -- Fix segfault on Cray machines if cgroup cpuset is used in cgroup/v1.