[slurm-users] Slurm version 17.11.3 available

Tim Wickberg Tue, 06 Feb 2018 15:16:52 -0800

We are pleased to announce the availability of Slurm version 17.11.3.

This includes over 44 fixes made since 17.11.2 was released last month,including one issue that can result in stray processes when a job iscanceled during a long-running prolog script.


Slurm can be downloaded from https://www.schedmd.com/downloads.php

- Tim

* Changes in Slurm 17.11.3
==========================
 -- Send SIG_UME correctly to a step.
 -- Sort sreport's reservation report by cluster, time_start, resv_name instead
    of cluster, resv_name, time_start.
 -- Avoid setting node in COMPLETING state indefinitely if the job initiating
    the node reboot is cancelled while the reboot in in progress.
 -- Scheduling fix for changing node features without any NodeFeatures plugins.
 -- Improve logic when summarizing job arrays mail notifications.
 -- Add scontrol -F/--future option to display nodes in FUTURE state.
 -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
 -- When a job array is preempting make it so tasks in the array don't wait
    to preempt other possible jobs.
 -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
    in slurmstepd.
 -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
    reconfigured.
 -- node_feature/knl_cray - Fix memory leak that can occur during normal
    operation.
 -- Fix srun environment variables for --prolog script.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix potential deadlock in the slurmctld when using list_for_each.
 -- Fix for possible memory corruption in srun when running heterogeneous job
    steps.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix output file containing "%t" (task ID) for heterogeneous job step to
    be based upon global task ID rather than task ID for that component of the
    heterogeneous job step.
 -- MYSQL - Fix potential abort when attempting to make an account a parent of
    itself.
 -- Fix potentially uninitialized variable in slurmctld.
 -- MYSQL - Fix issue for multi-dimensional machines when using sacct to
    find jobs that ran on specific nodes.
 -- Reject --acctg-freq at submit if invalid.
 -- Added info string on sh5util when deleting an empty file.
 -- Correct dragonfly topology support when job allocation specifies desired
    switch count.
 -- Fix minor memory leak on an sbcast error path.
 -- Fix issues when starting the backup slurmdbd.
 -- Revert uid check when requesting a jobid from a pid.
 -- task/cgroup - add support to detect OOM_KILL cgroup events.
 -- Fix whole node allocation cpu counts when --hint=nomultihtread.
 -- Allow execution of task prolog/epilog when uid has access
    rights by a secondary group id.
 -- Validate command existence on the srun *[pro|epi]log options
    if LaunchParameter test_exec is set.
 -- Fix potential memory leak if clean starting and the TRES didn't change
    from when last started.
 -- Fix for association MaxWall enforcement when none is given at submission.
 -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld.
 -- burst_buffer/cray: Attempts by job to create persistent burst buffer when
    one already exists owned by a different user will be logged and the job
    held.
 -- CRAY - Remove race in the core_spec where we add the slurmstepd to the
    job container where if the step was canceled would also cancel the stepd
    erroneously.
 -- Make sure the slurmstepd blocks signals like SIGTERM correctly.
 -- SPANK - When slurm_spank_init_post_opt() fails return error correctly.
 -- When revoking a sibling job in the federation we want to send a start
    message before purging the job record to get the uid of the revoked job.
 -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss
    was the only correct capitialization; UsePSS or usepss were silently
    ignored.
 -- Prevent pthread_atfork handlers from being added unnecessarily after
    'scontrol reconfigure', which can eventually lead to a crash if too
    many handlers have been registered.
 -- Better debug messages when MaxSubmitJobs is hit.
 -- Docs - update squeue man page to describe all possible job states.
 -- Preserve node features when slurmctld daemons reconfigured including active
    and available KNL features.
 -- Prevent orphaned step_extern steps when a job is cancelled while the
    prolog is still running

[slurm-users] Slurm version 17.11.3 available

Reply via email to