[slurm-users] Re: Mailing list upgrade - slurm-users list paused

2024-01-30 Thread Tim Wickberg via slurm-users

Welcome to the updated list. Posting is re-enabled now.

- Tim

On 1/30/24 11:56, Tim Wickberg wrote:

Hey folks -

The mailing list will be offline for about an hour as we upgrade the 
host, upgrade the mailing list software, and change the mail 
configuration around.


As part of these changes, the "From: " field will no longer be the 
original sender, but instead use the mailing list ID itself. This is to 
comply with DMARC sending options, and allow us to start DKIM signing 
messages to ensure deliverability once Google and Yahoo impose new 
policy changes in February.


This is the last post on the current (mailman2) list. I'll send a 
welcome message on the upgraded (mailman3) list once finished, and when 
the list is open to new traffic again.


- Tim



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm releases move to a six-month cycle

2024-03-26 Thread Tim Wickberg via slurm-users
Slurm major releases are moving to a six month release cycle. This 
change starts with the upcoming Slurm 24.05 release this May. Slurm 
24.11 will follow in November 2024. Major releases then continue every 
May and November in 2025 and beyond.


There are two main goals of this change:

- Faster delivery of newer features and functionality for customers.
- "Predictable" release timing, especially for those sites that would 
prefer to upgrade during an annual system maintenance window.


SchedMD will be adjusting our handling of backwards-compatibility within 
Slurm itself, and how SchedMD's support services will handle older releases.


For the 24.05 release, Slurm will still only support upgrading from (and 
mixed-version operations with) the prior two releases (23.11, 23.02). 
Starting with 24.11, Slurm will start supporting upgrades from the prior 
three releases (24.05, 23.11, 23.02).


SchedMD's Slurm Support has been built around an 18-month cycle. This 
18-month cycle has traditionally covered the current stable release, 
plus one prior major releases. With the increase in release frequency 
this support window will now cover to the current stable release, plus 
two prior major releases.


The blog post version of this announcement includes a table that 
outlines the updated support lifecycle:

https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle/

- Tim

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Convergence of Kube and Slurm?

2024-05-06 Thread Tim Wickberg via slurm-users
Note: I’m aware that I can run Kube on a single node, but we need more 
resources. So ultimately we need a way to have Slurm and Kube exist in 
the same cluster, both sharing the full amount of resources and both 
being fully aware of resource usage.


This is something that we (SchedMD) are working on, although it's a bit 
earlier than I was planning to publicly announce anything...


This is a very high-level view, and I have to apologize for stalling a 
bit, but: we've hired a team to build out a collection of tools that 
we're calling "Slinky" [1]. These provide for canonical ways of running 
Slurm within Kubernetes, ways of maintaining and managing the cluster 
state, and scheduling integration to allow for compute nodes to be 
available to both Kubernetes and Slurm environments while coordinating 
their status.


We'll be talking about it in more details at the Slurm User Group 
Meeting in Oslo [3], then KubeCon North America in Salt Lake, and SC'24 
in Atlanta. We'll have the (open-source, Apache 2.0 licensed) code for 
our first development phase available by SC'24 if not sooner.


There's a placeholder documentation page [4] that points to some of the 
presentations I've given before talking about approaches to tackling 
this converged-computing model, but I'll caution they're a bit dated and 
the Slinky-specific presentation we've been working on internally aren't 
publicly available yet.


If there are SchedMD support customers that have specific use cases, 
please feel free to ping your account managers if you'd like to chat at 
some point in the next few months.


- Tim

[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands 
for "Slurm in Kubernetes".


[2] https://slurm.schedmd.com/faq.html#acronym

[3] https://www.schedmd.com/about-schedmd/events/

[4] https://slurm.schedmd.com/slinky.html

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm version 24.05.1 is now available

2024-06-27 Thread Tim Wickberg via slurm-users

We are pleased to announce the availability of Slurm version 24.05.1.

This release addresses a number of minor-to-moderate issues since the 
24.05 release was first announced a month ago.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim



* Changes in Slurm 24.05.1
==
 -- Fix slurmctld and slurmdbd potentially stopping instead of performing a
logrotate when recieving SIGUSR2 when using auth/slurm.
 -- switch/hpe_slingshot - Fix slurmctld crash when upgrading from 23.02.
 -- Fix "Could not find group" errors from validate_group() when using
AllowGroups with large /etc/group files.
 -- Prevent an assertion in debugging builds when triggering log rotation
in a backup slurmctld.
 -- Add AccountingStoreFlags=no_stdio which allows to not record the stdio
paths of the job when set.
 -- slurmrestd - Prevent a slurmrestd segfault when parsing the crontab field,
which was never usable. Now it explicitly ignores the value and emits a
warning if it is used for the following endpoints:
  'POST /slurm/v0.0.39/job/{job_id}'
  'POST /slurm/v0.0.39/job/submit'
  'POST /slurm/v0.0.40/job/{job_id}'
  'POST /slurm/v0.0.40/job/submit'
  'POST /slurm/v0.0.41/job/{job_id}'
  'POST /slurm/v0.0.41/job/submit'
  'POST /slurm/v0.0.41/job/allocate'
 -- mpi/pmi2 - Fix communication issue leading to task launch failure with
"invalid kvs seq from node".
 -- Fix getting user environment when using sbatch with "--get-user-env" or
"--export=" when there is a user profile script that reads /proc.
 -- Prevent slurmd from crashing if acct_gather_energy/gpu is configured but
GresTypes is not configured.
 -- Do not log the following errors when AcctGatherEnergyType plugins are used
but a node does not have or cannot find sensors:
"error: _get_joules_task: can't get info from slurmd"
"error: slurm_get_node_energy: Zero Bytes were transmitted or received"
However, the following error will continue to be logged:
"error: Can't get energy data. No power sensors are available. Try later"
 -- sbatch, srun - Set SLURM_NETWORK environment variable if --network is set.
 -- Fix cloud nodes not being able to forward to nodes that restarted with new
IP addresses.
 -- Fix cwd not being set correctly when running a SPANK plugin with a
spank_user_init() hook and the new "contain_spank" option set.
 -- slurmctld - Avoid deadlock during shutdown when auth/slurm is active.
 -- Fix segfault in slurmctld with topology/block.
 -- sacct - Fix printing of job group for job steps.
 -- scrun - Log when an invalid environment variable causes the job submission
to be rejected.
 -- accounting_storage/mysql - Fix problem where listing or modifying an
association when specifying a qos list could hang or take a very long time.
 -- gpu/nvml - Fix gpuutil/gpumem only tracking last GPU in step. Now,
gpuutil/gpumem will record sums of all GPUS in the step.
 -- Fix error in scrontab jobs when using slurm.conf:PropagatePrioProcess=1.
 -- Fix slurmctld crash on a batch job submission with "--nodes 0,...".
 -- Fix dynamic IP address fanout forwarding when using auth/slurm.
 -- Restrict listening sockets in the mpi/pmix plugin and sattach to the
SrunPortRange.
 -- slurmrestd - Limit mime types returned from query to 'GET /openapi/v3' to
only return one mime type per serializer plugin to fix issues with OpenAPI
client generators that are unable to handle multiple mime type aliases.
 -- Fix many commands possibly reporting an "Unexpected Message Received" when
in reality the connection timed out.
 -- Prevent slurmctld from starting if there is not a json serializer present
and the extra_constraints feature is enabled.
 -- Fix heterogeneous job components not being signaled with scancel --ctld and
'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given,
the heterogeneous job components match the given filters, and the
heterogeneous job leader does not match the given filters.
 -- Fix regression from 23.02 impeding job licenses from being cleared.
 -- Move error to log_flag which made _get_joules_task error to be logged to the
user when too many rpcs were queued in slurmd for gathering energy.
 -- For scancel --ctld and the associated rest api endpoints:
  'DELETE /slurm/v0.0.40/jobs'
  'DELETE /slurm/v0.0.41/jobs'
Fix canceling the final array task in a job array when the task is pending
and all array tasks have been split into separate job records. Previously
this task was not canceled.
 -- Fix power_save operation after recovering from a failed reconfigure.
 -- slurmctld - Skip removing the pidfile when running under systemd. In that
situation it is never created in the first place.
 -- Fix issue where altering the flags on a Slurm account (UsersAreCoords)
several limits on the account's association would be set to 0 in
Slurm's internal cache.
 -- Fi

[slurm-users] Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available (security fix for switch plugins)

2024-07-31 Thread Tim Wickberg via slurm-users
Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available and 
include a fix for a recently discovered security issue with the switch 
plugins.


SchedMD customers were informed on July 17th and provided a patch on 
request; this process is documented in our security policy. [1]


For the switch/hpe_slingshot and switch/nvidia_imex plugins, a user 
could override the isolation between Slingshot VNIs or IMEX channels.


If you do not have one of these switch plugins configured, then you are 
not impacted by this issue.


It is unclear what, if any, information could be accessed with access to 
an unauthorized channel. This disclosure is being made out of an 
abundance of caution.


If you do have one of these plugins enabled, the slurmctld must be 
restarted before the slurmd daemons to avoid disruption.


Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security-policy/

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support


* Changes in Slurm 24.05.2
==
 -- Fix energy gathering rpc counter underflow in _rpc_acct_gather_energy when
more than 10 threads try to get energy at the same time. This prevented
the possibility to get energy from slurmd by any step until slurmd was
restarted, so losing energy accounting metrics in the node.
 -- accounting_storage/mysql - Fix issue where new user with wckey did not
have a default wckey sent to the slurmctld.
 -- slurmrestd - Prevent slurmrestd segfault when handling the following
endpoints when none of the optional parameters are specified:
  'DELETE /slurm/v0.0.40/jobs'
  'DELETE /slurm/v0.0.41/jobs'
  'GET /slurm/v0.0.40/shares'
  'GET /slurm/v0.0.41/shares'
  'GET /slurmdb/v0.0.40/instance'
  'GET /slurmdb/v0.0.41/instance'
  'GET /slurmdb/v0.0.40/instances'
  'GET /slurmdb/v0.0.41/instances'
  'POST /slurm/v0.0.40/job/{job_id}'
  'POST /slurm/v0.0.41/job/{job_id}'
 -- Fix IPMI energy gathering when no IPMIPowerSensors are specified in
acct_gather.conf. This situation resulted in an accounted energy of 0
for job steps.
 -- Fix a minor memory leak in slurmctld when updating a job dependency.
 -- scontrol,squeue - Fix regression that caused incorrect values for
multisocket nodes at '.jobs[].job_resources.nodes.allocation' for
'scontrol show jobs --(json|yaml)' and 'squeue --(json|yaml)'.
 -- slurmrestd - Fix regression that caused incorrect values for
multisocket nodes at '.jobs[].job_resources.nodes.allocation' to be dumped
with endpoints:
  'GET /slurm/v0.0.41/job/{job_id}'
  'GET /slurm/v0.0.41/jobs'
 -- jobcomp/filetxt - Fix truncation of job record lines > 1024 characters.
 -- Fixed regression that prevented compilation on FreeBSD hosts.
 -- switch/hpe_slingshot - Drain node on failure to delete CXI services.
 -- Fix a performance regression from 23.11.0 in cpu frequency handling when no
CpuFreqDef is defined.
 -- Fix one-task-per-sharing not working across multiple nodes.
 -- Fix inconsistent number of cpus when creating a reservation using the
TRESPerNode option.
 -- data_parser/v0.0.40+ - Fix job state parsing which could break filtering.
 -- Prevent cpus-per-task to be modified in jobs where a -c value has been
explicitly specified and the requested memory constraints implicitly
increase the number of CPUs to allocate.
 -- slurmrestd - Fix regression where args '-s v0.0.39,dbv0.0.39' and
'-d v0.0.39' would result in 'GET /openapi/v3' not registering as a valid
possible query resulting in 404 errors.
 -- slurmrestd - Fix memory leak for dbv0.0.39 jobs query which occurred if the
query parameters specified account, association, cluster, constraints,
format, groups, job_name, partition, qos, reason, reservation, state, users,
or wckey. This affects the following endpoints:
  'GET /slurmdb/v0.0.39/jobs'
 -- slurmrestd - In the case the slurmdbd does not respond to a persistent
connection init message, prevent the closed fd from being used, and instead
emit an error or warning depending on if the connection was required.
 -- Fix 24.05.0 regression that caused the slurmdbd not to send back an error
message if there is an error initializing a persistent connection.
 -- Reduce latency of forwarded x11 packets.
 -- Add "curr_dependency" (representing the current dependency of the job)
and "orig_dependency" (representing the original requested dependency of
the job) fields to the job record in job_submit.lua (for job update) and
jobcomp.lua.
 -- Fix potential segfault of slurmctld configured with
SlurmctldParameters=enable_rpc_queue from happening on reconfigure.
 -- Fix potential segfault of slurmctld on its shutdown when rate limitting
is enabled.
 -- slurmrestd - Fix missing job environment for SLURM_JOB_NAME,
SLURM_OPEN_MODE, SLURM_JOB_DEPENDENCY,

[slurm-users] Slurm version 24.05.4 is now available (CVE-2024-48936)

2024-10-23 Thread Tim Wickberg via slurm-users
Slurm version 24.05.4 is now available and includes a fix for a recently 
discovered security issue with the new stepmgr subsystem.


SchedMD customers were informed on October 9th and provided a patch on
request; this process is documented in our security policy. [1]

A mistake in authentication handling in stepmgr could permit an attacker 
to execute processes under other users' jobs. This is limited to jobs 
explicitly running with --stepmgr, or on systems that have globally 
enabled stepmgr through "SlurmctldParameters=enable_stepmgr" in their 
configuration. CVE-2024-48936.


Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security-policy/

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support


* Changes in Slurm 24.05.4
==
 -- Fix generic int sort functions.
 -- Fix user look up using possible unrealized uid in the dbd.
 -- Fix FreeBSD compile issue with tls/none plugin.
 -- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser
when SlurmUser was not root.
 -- mpi/pmix fix race conditions with het jobs at step start/end which could
make srun to hang.
 -- Fix not showing some SelectTypeParameters in scontrol show config.
 -- Avoid assert when dumping removed certain fields in JSON/YAML.
 -- Improve how shards are scheduled with affinity in mind.
 -- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set
in the same QOS.
 -- Prevent backfill from planning jobs that use overlapping resources for the
same time slot if the job's time limit is less than bf_resolution.
 -- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu.
 -- Prevent backfill from breaking out due to "system state changed" every 30
seconds if reservations use REPLACE or REPLACE_DOWN flags.
 -- slurmrestd - Make sure that scheduler_unset parameter defaults to true even
when the following flags are also set: show_duplicates, skip_steps,
disable_truncate_usage_time, run_away_jobs, whole_hetjob,
disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time,
show_batch_script, and or show_job_environment. Additionaly, always make
sure show_duplicates and disable_truncate_usage_time default to true when
the following flags are also set: scheduler_unset, scheduled_on_submit,
scheduled_by_main, scheduled_by_backfill, and or job_started. This effects
the following endpoints:
  'GET /slurmdb/v0.0.40/jobs'
  'GET /slurmdb/v0.0.41/jobs'
 -- Ignore --json and --yaml options for scontrol show config to prevent mixing
output types.
 -- Fix not considering nodes in reservations with Maintenance or Overlap flags
when creating new reservations with nodecnt or when they replace down nodes.
 -- Fix suspending/resuming steps running under a 23.02 slurmstepd process.
 -- Fix options like sprio --me and squeue --me for users with a uid greater
than 2147483647.
 -- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the
slurmctld to crash.
 -- sacctmgr - Fix issue where clearing out a preemption list using
preempt='' would cause the given qos to no longer be preempt-able until set
again.
 -- Fix stepmgr creating job steps concurrently.
 -- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number"
fields.
 -- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number"
fields.
 -- slurmctld - Fix a potential leak while updating a reservation.
 -- slurmctld - Fix state save with reservation flags when a update fails.
 -- Fix reservation update issues with parameters Accounts and Users, when
using +/- signs.
 -- slurmrestd - Don't dump warning on empty wckeys in:
  'GET /slurmdb/v0.0.40/config'
  'GET /slurmdb/v0.0.41/config'
 -- Fix slurmd possibly leaving zombie processes on start up in configless when
the initial attempt to fetch the config fails.
 -- Fix crash when trying to drain a non-existing node (possibly deleted
before).
 -- slurmctld - fix segfault when calculating limit decay for jobs with an
invalid association.
 -- Fix IPMI energy gathering with multiple sensors.
 -- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a
source string.
 -- slurmrestd - Prevent potential segfault when there is an error parsing an
array field which could lead to a double xfree. This applies to several
endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41.
 -- scancel - Fix a regression from 23.11.6 where using both the --ctld and
--sibling options would cancel the federated job on all clusters instead of
only the cluster(s) specified by --sibling.
 -- accounting_storage/mysql - Fix bug when removing an association
specified with an empty partition.
 -- Fix setting multiple partition state restore on a job correctly.
 -- Fix difference in behavior when s

[slurm-users] Slurm version 24.11 is now available

2024-11-29 Thread Tim Wickberg via slurm-users

We are pleased to announce the availability of the Slurm 24.11 release.

To highlight some new features in 24.11:

- New gpu/nvidia plugin. This does not rely on any NVIDIA libraries, and 
will

  build by default on all systems. It supports basic GPU detection and
  management, but cannot currently identify GPU-to-GPU links, or provide
  usage data as these are not exposed by the kernel driver.
- Add autodetected GPUs to the output from "slurmd -C".
- Added new QOS-based reports to "sreport".
- Revamped network I/O with the "conmgr" thread-pool model.
- Added new "hostlist function" syntax for management commands and
  configuration files.
- switch/hpe_slingshot - Added support for hardware collectives setup 
through

  the fabric manager. (Requires SlurmctldParameters=enable_stepmgr)
- Added SchedulerParameters=bf_allow_magnetic_slot configuration option to
  allow backfill planning for magnetic reservations.
- Added new "scontrol listjobs" and "liststeps" commands to complement
  "listpids", and provide --json/--yaml output for all three subcommands.
- Allow jobs to be submitted against multiple QOSes.
- Added new experimental "oracle" backfill scheduling support, which permits
  jobs to be delayed if the oracle function determines the reduced
  fragmentation of the network topology is sufficiently advantageous.
- Improved responsiveness of the controller when jobs are requeued by
  replacing the "db_index" identifier with a slurmctld-generated unique
  identifier. ("SLUID")
- New options to job_container/tmpfs to permit site-specific scripts to
  modify the namespace before user steps are launched, and to ensure all
  steps are completely captured within that namespace.

The Slurm documentation has also been updated to the 24.11 release.
(Older versions can be found in the archive, linked from the main
documentation page.)

Slurm can be downloaded from https://www.schedmd.com/download-slurm/ .

- Tim

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm version 24.11.1 is now available

2025-01-24 Thread Tim Wickberg via slurm-users

https://github.com/SchedMD/slurm/blob/slurm-24.11/NEWS is current.

We've changed how the release branches are managed which means that the 
changes for each maintenance release aren't reflected in the master 
branch version of that file. The release-branch-specific NEWS is being 
updated for the existing stable releases as each new maintenance release 
is tagged. (It's now generated from the Changelog: commit trailers, 
instead of directly changed as commits are pushed.)


There will likely be further changes to NEWS and RELEASE_NOTES for 25.05 
when released this spring, but we haven't settled on exactly what that 
will look like yet.


- Tim

On 1/24/25 01:01, Ole Holm Nielsen via slurm-users wrote:

Hi Marshall,

Could you update the NEWS file?
https://github.com/SchedMD/slurm/blob/master/NEWS

Thanks,
Ole


--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Unable to receive password reminder

2025-01-14 Thread Tim Wickberg via slurm-users
Apologies for the confusion - we've fixed some internal mail routing for 
the list admin accounts and shouldn't miss requests like what you'd sent 
again. You've been removed.


For anyone in a similar situation, the footer on all list messages mentions:

"To unsubscribe send an email to slurm-users-le...@lists.schedmd.com"

You do need to send that request from the address you're receiving 
mailing list traffic on, and will need to confirm the request by 
replying to an auto-generated response once. (This is to prevent someone 
forging your address and silently unsubscribing you.)


If you want to directly manage your subscriptions, you can create an 
account on https://lists.schedmd.com/ matching your subscribed address, 
and subscribe or unsubscribe from there. Unfortunately this is a bit 
more involved, but was an unavoidable change when we needed to migrate 
to Mailman 3.


- Tim

On 1/14/25 06:17, Loris Bennett via slurm-users wrote:

Hi,

Over a week ago I sent the message below to the address I found for the
list owner, but have not received a response.

Does anyone know how to proceed in this case?

Cheers,

Loris

 Start of forwarded message 
From: Loris Bennett 
To: 
Subject: Unable to receive password reminder
Date: Mon, 6 Jan 2025 08:35:42 +0100

Dear list owner,

I have recently switched from reading the list via mail to using the
mail to news gateway at news.gmane.io.  Therefore I would like change my
mailman settings in order to stop delivery of postings via mail.

As I have forgotten my list password, I requested a reminder.  However I
get the reply that no user with the given email address was found in the
user database.  The addresses I tried were

   loris.benn...@fu-berlin.de
   lo...@zedat.fu-berlin.de

the former being an alias for the latter.  This is the email account
which to which emails from the list are sent, so I am somewhat confused
as to why the neither of the addresses is recognised.

Could you please help me to resolve this issue?

Regards

Loris Bennett



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available (CVE-2025-43904)

2025-05-07 Thread Tim Wickberg via slurm-users
Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available and 
include a fix for a recently discovered security issue.


SchedMD customers were informed on April 23rd and provided a patch on
request; this process is documented in our security policy. [1]

A mistake with permission handling for Coordinators within Slurm's 
accounting system can allow a Coordinator to promote a user to 
Administrator. (CVE-2025-43904)


Thank you to Sekou Diakite (HPE) for reporting this.

Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security-policy/

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support


* Changes in Slurm 24.11.5
==
 -- Return error to scontrol reboot on bad nodelists.
 -- slurmrestd - Report an error when QOS resolution fails for v0.0.40
endpoints.
 -- slurmrestd - Report an error when QOS resolution fails for v0.0.41
endpoints.
 -- slurmrestd - Report an error when QOS resolution fails for v0.0.42
endpoints.
 -- data_parser/v0.0.42 - Added +inline_enums flag which modifies the
output when generating OpenAPI specification. It causes enum arrays to not
be defined in their own schema with references ($ref) to them. Instead they
will be dumped inline.
 -- Fix binding error with tres-bind map/mask on partial node allocations.
 -- Fix stepmgr enabled steps being able to request features.
 -- Reject step creation if requested feature is not available in job.
 -- slurmd - Restrict listening for new incoming RPC requests further into
startup.
 -- slurmd - Avoid auth/slurm related hangs of CLI commands during startup
and shutdown.
 -- slurmctld - Restrict processing new incoming RPC requests further into
startup. Stop processing requests sooner during shutdown.
 -- slurmcltd - Avoid auth/slurm related hangs of CLI commands during
startup and shutdown.
 -- slurmctld: Avoid race condition during shutdown or reconfigure that
could result in a crash due delayed processing of a connection while
plugins are unloaded.
 -- Fix small memleak when getting the job list from the database.
 -- Fix incorrect printing of % escape characters when printing stdio
fields for jobs.
 -- Fix padding parsing when printing stdio fields for jobs.
 -- Fix printing %A array job id when expanding patterns.
 -- Fix reservations causing jobs to be held for Bad Constraints
 -- switch/hpe_slingshot - Prevent potential segfault on failed curl
request to the fabric manager.
 -- Fix printing incorrect array job id when expanding stdio file names.
The %A will now be substituted by the correct value.
 -- Fix printing incorrect array job id when expanding stdio file names.
The %A will now be substituted by the correct value.
 -- switch/hpe_slingshot - Fix vni range not updating on slurmctld restart
or reconfigre.
 -- Fix steps not being created when using certain combinations of -c and
-n inferior to the jobs requested resources, when using stepmgr and nodes
are configured with CPUs == Sockets*CoresPerSocket.
 -- Permit configuring the number of retry attempts to destroy CXI service
via the new destroy_retries SwitchParameter.
 -- Do not reset memory.high and memory.swap.max in slurmd startup or
reconfigure as we are never really touching this in slurmd.
 -- Fix reconfigure failure of slurmd when it has been started manually and
the CoreSpecLimits have been removed from slurm.conf.
 -- Set or reset CoreSpec limits when slurmd is reconfigured and it was
started with systemd.
 -- switch/hpe-slingshot - Make sure the slurmctld can free step VNIs after
the controller restarts or reconfigures while the job is running.
 -- Fix backup slurmctld failure on 2nd takeover.
 -- Testsuite - fix python test 130_2.
 -- Fix security issue where a coordinator could add a user with elevated
privileges. CVE-2025-43904.



* Changes in Slurm 24.05.8
==
 -- Testsuite - fix python test 130_2.
 -- Fix security issue where a coordinator could add a user with elevated
privileges. CVE-2025-43904.



* Changes in Slurm 23.11.11
===
 -- Fixed a job requeuing issue that merged job entries into the same SLUID
when all nodes in a job failed simultaneously.
 -- Add ABORT_ON_FATAL environment variable to capture a backtrace from any
fatal() message.
 -- Testsuite - fix python test 130_2.
 -- Fix security issue where a coordinator could add a user with elevated
privileges. CVE-2025-43904.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm version 25.05 is now available

2025-05-29 Thread Tim Wickberg via slurm-users

We are pleased to announce the availability of Slurm 25.05.

The release notes summarizing the new features, and including links to 
the corresponding documentation, can be found at:

https://slurm.schedmd.com/release_notes.html

A more extensive list of changes are available in the CHANGELOG:
https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md

The Slurm documentation has also been updated to the 25.05 release:
https://slurm.schedmd.com

Slurm can be downloaded from:
https://www.schedmd.com/download-slurm/

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com