We are pleased to announce the availability of Slurm version 20.11.5.
This includes a number of moderate severity bug fixes, alongside a new
job_container/tmpfs plugin developed by NERSC that can be used to create
per-job filesystem namespaces.
Initial documentation for this plugin is available at:
https://slurm.schedmd.com/job_containe.conf.html
Slurm can be downloaded from https://www.schedmd.com/downloads.php .
- Tim
--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support
* Changes in Slurm 20.11.5
========================== > -- Fix main scheduler bug where bf_hetjob_prio truncates
SchedulerParameters.
-- Fix sacct not displaying UserCPU, SystemCPU and TotalCPU for large times.
-- scrontab - fix to return the correct index for a bad #SCRON option.
-- scrontab - fix memory leak when invalid option found in #SCRON line.
-- Add errno for when a user requests multiple partitions and they are using
partition based associations.
-- Fix issue where a job could run in a wrong partition when using
EnforcePartLimits=any and partition based associations.
-- Remove possible deadlock when adding associations/wckeys in multiple
threads.
-- When using PrologFlags=alloc make sure the correct Slurm version is set
in the credential.
-- When sending a job a warning signal make sure we always send SIGCONT
beforehand.
-- Fix issue where a batch job would continue running if a prolog failed on a
node that wasn't the batch host and requeuing was disabled.
-- Fix issue where sometimes salloc/srun wouldn't get a message about a prolog
failure in the job's stdout.
-- Requeue or kill job on a prolog failure when PrologFlags is not set.
-- Fix race condition causing node reboots to get requeued before
ResumeTimeout expires.
-- Preserve node boot_req_time on reconfigure.
-- Preserve node power_save_req_time on reconfigure.
-- Fix node reboots being queued and issued multiple times and preventing the
reboot to time out.
-- Fix debug message related to GrpTRESRunMin (AssocGrpCPURunMinutesLimit).
-- Fix run_command to exit correctly if track_script kills the calling thread.
-- Only requeue a job when the PrologSlurmctld returns nonzero.
-- When a job is signaled with SIGKILL make sure we flush all
prologs/setup scripts.
-- Handle burst buffer scripts if the job is canceled while stage_in is
happening.
-- When shutting down the slurmctld make note to ignore error message when
we have to kill a prolog/setup script we are tracking.
-- scrontab - add support for the --open-mode option.
-- acct_gather_profile/influxdb - avoid segfault on plugin shutdown if setup
has not completed successfully.
-- Reduce delay in starting salloc allocations when running with prologs.
-- Fix issue passing open fd's with [send|recv]msg.
-- Alter AllocNodes check to work if the allocating node's domain doesn't
match the slurmctld's. This restores the pre-20.11 behavior.
-- Fix slurmctld segfault if jobs from a prior version had the now-removed
INVALID_DEPEND state flag set and were allowed to run in 20.11.
-- Add job_container/tmpfs plugin to give a method to provide a private /tmp
per job.
-- Set the correct core affinity when using AutoDetect.
-- Start relying on the conf again in xcpuinfo_mac_to_abs().
-- Fix global_last_rollup assignment on job resizing.
-- slurmrestd - hand over connection context on _on_message_complete().
-- slurmrestd - mark "environment" as required for job submissions in schema.
-- slurmrestd - Disable credential reuse on the same TCP connection. Pipelined
HTTP connections will have to provide authentication with every request.
-- Avoid data conversion error on NULL strings in data_get_string_converted().
-- Handle situation where slurmctld is too slow processing
REQUEST_COMPLETE_BATCH_SCRIPT and it gets resent from the slurmstepd.