[slurm-users] Re: Mailing list upgrade - slurm-users list paused
Welcome to the updated list. Posting is re-enabled now. - Tim On 1/30/24 11:56, Tim Wickberg wrote: Hey folks - The mailing list will be offline for about an hour as we upgrade the host, upgrade the mailing list software, and change the mail configuration around. As part of these changes, the "From: " field will no longer be the original sender, but instead use the mailing list ID itself. This is to comply with DMARC sending options, and allow us to start DKIM signing messages to ensure deliverability once Google and Yahoo impose new policy changes in February. This is the last post on the current (mailman2) list. I'll send a welcome message on the upgraded (mailman3) list once finished, and when the list is open to new traffic again. - Tim -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm releases move to a six-month cycle
Slurm major releases are moving to a six month release cycle. This change starts with the upcoming Slurm 24.05 release this May. Slurm 24.11 will follow in November 2024. Major releases then continue every May and November in 2025 and beyond. There are two main goals of this change: - Faster delivery of newer features and functionality for customers. - "Predictable" release timing, especially for those sites that would prefer to upgrade during an annual system maintenance window. SchedMD will be adjusting our handling of backwards-compatibility within Slurm itself, and how SchedMD's support services will handle older releases. For the 24.05 release, Slurm will still only support upgrading from (and mixed-version operations with) the prior two releases (23.11, 23.02). Starting with 24.11, Slurm will start supporting upgrades from the prior three releases (24.05, 23.11, 23.02). SchedMD's Slurm Support has been built around an 18-month cycle. This 18-month cycle has traditionally covered the current stable release, plus one prior major releases. With the increase in release frequency this support window will now cover to the current stable release, plus two prior major releases. The blog post version of this announcement includes a table that outlines the updated support lifecycle: https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle/ - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Convergence of Kube and Slurm?
Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage. This is something that we (SchedMD) are working on, although it's a bit earlier than I was planning to publicly announce anything... This is a very high-level view, and I have to apologize for stalling a bit, but: we've hired a team to build out a collection of tools that we're calling "Slinky" [1]. These provide for canonical ways of running Slurm within Kubernetes, ways of maintaining and managing the cluster state, and scheduling integration to allow for compute nodes to be available to both Kubernetes and Slurm environments while coordinating their status. We'll be talking about it in more details at the Slurm User Group Meeting in Oslo [3], then KubeCon North America in Salt Lake, and SC'24 in Atlanta. We'll have the (open-source, Apache 2.0 licensed) code for our first development phase available by SC'24 if not sooner. There's a placeholder documentation page [4] that points to some of the presentations I've given before talking about approaches to tackling this converged-computing model, but I'll caution they're a bit dated and the Slinky-specific presentation we've been working on internally aren't publicly available yet. If there are SchedMD support customers that have specific use cases, please feel free to ping your account managers if you'd like to chat at some point in the next few months. - Tim [1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands for "Slurm in Kubernetes". [2] https://slurm.schedmd.com/faq.html#acronym [3] https://www.schedmd.com/about-schedmd/events/ [4] https://slurm.schedmd.com/slinky.html -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm version 24.05.1 is now available
We are pleased to announce the availability of Slurm version 24.05.1. This release addresses a number of minor-to-moderate issues since the 24.05 release was first announced a month ago. Slurm can be downloaded from https://www.schedmd.com/downloads.php . - Tim * Changes in Slurm 24.05.1 == -- Fix slurmctld and slurmdbd potentially stopping instead of performing a logrotate when recieving SIGUSR2 when using auth/slurm. -- switch/hpe_slingshot - Fix slurmctld crash when upgrading from 23.02. -- Fix "Could not find group" errors from validate_group() when using AllowGroups with large /etc/group files. -- Prevent an assertion in debugging builds when triggering log rotation in a backup slurmctld. -- Add AccountingStoreFlags=no_stdio which allows to not record the stdio paths of the job when set. -- slurmrestd - Prevent a slurmrestd segfault when parsing the crontab field, which was never usable. Now it explicitly ignores the value and emits a warning if it is used for the following endpoints: 'POST /slurm/v0.0.39/job/{job_id}' 'POST /slurm/v0.0.39/job/submit' 'POST /slurm/v0.0.40/job/{job_id}' 'POST /slurm/v0.0.40/job/submit' 'POST /slurm/v0.0.41/job/{job_id}' 'POST /slurm/v0.0.41/job/submit' 'POST /slurm/v0.0.41/job/allocate' -- mpi/pmi2 - Fix communication issue leading to task launch failure with "invalid kvs seq from node". -- Fix getting user environment when using sbatch with "--get-user-env" or "--export=" when there is a user profile script that reads /proc. -- Prevent slurmd from crashing if acct_gather_energy/gpu is configured but GresTypes is not configured. -- Do not log the following errors when AcctGatherEnergyType plugins are used but a node does not have or cannot find sensors: "error: _get_joules_task: can't get info from slurmd" "error: slurm_get_node_energy: Zero Bytes were transmitted or received" However, the following error will continue to be logged: "error: Can't get energy data. No power sensors are available. Try later" -- sbatch, srun - Set SLURM_NETWORK environment variable if --network is set. -- Fix cloud nodes not being able to forward to nodes that restarted with new IP addresses. -- Fix cwd not being set correctly when running a SPANK plugin with a spank_user_init() hook and the new "contain_spank" option set. -- slurmctld - Avoid deadlock during shutdown when auth/slurm is active. -- Fix segfault in slurmctld with topology/block. -- sacct - Fix printing of job group for job steps. -- scrun - Log when an invalid environment variable causes the job submission to be rejected. -- accounting_storage/mysql - Fix problem where listing or modifying an association when specifying a qos list could hang or take a very long time. -- gpu/nvml - Fix gpuutil/gpumem only tracking last GPU in step. Now, gpuutil/gpumem will record sums of all GPUS in the step. -- Fix error in scrontab jobs when using slurm.conf:PropagatePrioProcess=1. -- Fix slurmctld crash on a batch job submission with "--nodes 0,...". -- Fix dynamic IP address fanout forwarding when using auth/slurm. -- Restrict listening sockets in the mpi/pmix plugin and sattach to the SrunPortRange. -- slurmrestd - Limit mime types returned from query to 'GET /openapi/v3' to only return one mime type per serializer plugin to fix issues with OpenAPI client generators that are unable to handle multiple mime type aliases. -- Fix many commands possibly reporting an "Unexpected Message Received" when in reality the connection timed out. -- Prevent slurmctld from starting if there is not a json serializer present and the extra_constraints feature is enabled. -- Fix heterogeneous job components not being signaled with scancel --ctld and 'DELETE slurm/v0.0.40/jobs' if the job ids are not explicitly given, the heterogeneous job components match the given filters, and the heterogeneous job leader does not match the given filters. -- Fix regression from 23.02 impeding job licenses from being cleared. -- Move error to log_flag which made _get_joules_task error to be logged to the user when too many rpcs were queued in slurmd for gathering energy. -- For scancel --ctld and the associated rest api endpoints: 'DELETE /slurm/v0.0.40/jobs' 'DELETE /slurm/v0.0.41/jobs' Fix canceling the final array task in a job array when the task is pending and all array tasks have been split into separate job records. Previously this task was not canceled. -- Fix power_save operation after recovering from a failed reconfigure. -- slurmctld - Skip removing the pidfile when running under systemd. In that situation it is never created in the first place. -- Fix issue where altering the flags on a Slurm account (UsersAreCoords) several limits on the account's association would be set to 0 in Slurm's internal cache. -- Fi
[slurm-users] Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available (security fix for switch plugins)
Slurm versions 24.05.2, 23.11.9, and 23.02.8 are now available and include a fix for a recently discovered security issue with the switch plugins. SchedMD customers were informed on July 17th and provided a patch on request; this process is documented in our security policy. [1] For the switch/hpe_slingshot and switch/nvidia_imex plugins, a user could override the isolation between Slingshot VNIs or IMEX channels. If you do not have one of these switch plugins configured, then you are not impacted by this issue. It is unclear what, if any, information could be accessed with access to an unauthorized channel. This disclosure is being made out of an abundance of caution. If you do have one of these plugins enabled, the slurmctld must be restarted before the slurmd daemons to avoid disruption. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim [1] https://www.schedmd.com/security-policy/ -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support * Changes in Slurm 24.05.2 == -- Fix energy gathering rpc counter underflow in _rpc_acct_gather_energy when more than 10 threads try to get energy at the same time. This prevented the possibility to get energy from slurmd by any step until slurmd was restarted, so losing energy accounting metrics in the node. -- accounting_storage/mysql - Fix issue where new user with wckey did not have a default wckey sent to the slurmctld. -- slurmrestd - Prevent slurmrestd segfault when handling the following endpoints when none of the optional parameters are specified: 'DELETE /slurm/v0.0.40/jobs' 'DELETE /slurm/v0.0.41/jobs' 'GET /slurm/v0.0.40/shares' 'GET /slurm/v0.0.41/shares' 'GET /slurmdb/v0.0.40/instance' 'GET /slurmdb/v0.0.41/instance' 'GET /slurmdb/v0.0.40/instances' 'GET /slurmdb/v0.0.41/instances' 'POST /slurm/v0.0.40/job/{job_id}' 'POST /slurm/v0.0.41/job/{job_id}' -- Fix IPMI energy gathering when no IPMIPowerSensors are specified in acct_gather.conf. This situation resulted in an accounted energy of 0 for job steps. -- Fix a minor memory leak in slurmctld when updating a job dependency. -- scontrol,squeue - Fix regression that caused incorrect values for multisocket nodes at '.jobs[].job_resources.nodes.allocation' for 'scontrol show jobs --(json|yaml)' and 'squeue --(json|yaml)'. -- slurmrestd - Fix regression that caused incorrect values for multisocket nodes at '.jobs[].job_resources.nodes.allocation' to be dumped with endpoints: 'GET /slurm/v0.0.41/job/{job_id}' 'GET /slurm/v0.0.41/jobs' -- jobcomp/filetxt - Fix truncation of job record lines > 1024 characters. -- Fixed regression that prevented compilation on FreeBSD hosts. -- switch/hpe_slingshot - Drain node on failure to delete CXI services. -- Fix a performance regression from 23.11.0 in cpu frequency handling when no CpuFreqDef is defined. -- Fix one-task-per-sharing not working across multiple nodes. -- Fix inconsistent number of cpus when creating a reservation using the TRESPerNode option. -- data_parser/v0.0.40+ - Fix job state parsing which could break filtering. -- Prevent cpus-per-task to be modified in jobs where a -c value has been explicitly specified and the requested memory constraints implicitly increase the number of CPUs to allocate. -- slurmrestd - Fix regression where args '-s v0.0.39,dbv0.0.39' and '-d v0.0.39' would result in 'GET /openapi/v3' not registering as a valid possible query resulting in 404 errors. -- slurmrestd - Fix memory leak for dbv0.0.39 jobs query which occurred if the query parameters specified account, association, cluster, constraints, format, groups, job_name, partition, qos, reason, reservation, state, users, or wckey. This affects the following endpoints: 'GET /slurmdb/v0.0.39/jobs' -- slurmrestd - In the case the slurmdbd does not respond to a persistent connection init message, prevent the closed fd from being used, and instead emit an error or warning depending on if the connection was required. -- Fix 24.05.0 regression that caused the slurmdbd not to send back an error message if there is an error initializing a persistent connection. -- Reduce latency of forwarded x11 packets. -- Add "curr_dependency" (representing the current dependency of the job) and "orig_dependency" (representing the original requested dependency of the job) fields to the job record in job_submit.lua (for job update) and jobcomp.lua. -- Fix potential segfault of slurmctld configured with SlurmctldParameters=enable_rpc_queue from happening on reconfigure. -- Fix potential segfault of slurmctld on its shutdown when rate limitting is enabled. -- slurmrestd - Fix missing job environment for SLURM_JOB_NAME, SLURM_OPEN_MODE, SLURM_JOB_DEPENDENCY,
[slurm-users] Slurm version 24.05.4 is now available (CVE-2024-48936)
Slurm version 24.05.4 is now available and includes a fix for a recently discovered security issue with the new stepmgr subsystem. SchedMD customers were informed on October 9th and provided a patch on request; this process is documented in our security policy. [1] A mistake in authentication handling in stepmgr could permit an attacker to execute processes under other users' jobs. This is limited to jobs explicitly running with --stepmgr, or on systems that have globally enabled stepmgr through "SlurmctldParameters=enable_stepmgr" in their configuration. CVE-2024-48936. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim [1] https://www.schedmd.com/security-policy/ -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support * Changes in Slurm 24.05.4 == -- Fix generic int sort functions. -- Fix user look up using possible unrealized uid in the dbd. -- Fix FreeBSD compile issue with tls/none plugin. -- slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser when SlurmUser was not root. -- mpi/pmix fix race conditions with het jobs at step start/end which could make srun to hang. -- Fix not showing some SelectTypeParameters in scontrol show config. -- Avoid assert when dumping removed certain fields in JSON/YAML. -- Improve how shards are scheduled with affinity in mind. -- Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set in the same QOS. -- Prevent backfill from planning jobs that use overlapping resources for the same time slot if the job's time limit is less than bf_resolution. -- Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu. -- Prevent backfill from breaking out due to "system state changed" every 30 seconds if reservations use REPLACE or REPLACE_DOWN flags. -- slurmrestd - Make sure that scheduler_unset parameter defaults to true even when the following flags are also set: show_duplicates, skip_steps, disable_truncate_usage_time, run_away_jobs, whole_hetjob, disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time, show_batch_script, and or show_job_environment. Additionaly, always make sure show_duplicates and disable_truncate_usage_time default to true when the following flags are also set: scheduler_unset, scheduled_on_submit, scheduled_by_main, scheduled_by_backfill, and or job_started. This effects the following endpoints: 'GET /slurmdb/v0.0.40/jobs' 'GET /slurmdb/v0.0.41/jobs' -- Ignore --json and --yaml options for scontrol show config to prevent mixing output types. -- Fix not considering nodes in reservations with Maintenance or Overlap flags when creating new reservations with nodecnt or when they replace down nodes. -- Fix suspending/resuming steps running under a 23.02 slurmstepd process. -- Fix options like sprio --me and squeue --me for users with a uid greater than 2147483647. -- fatal() if BlockSizes=0. This value is invalid and would otherwise cause the slurmctld to crash. -- sacctmgr - Fix issue where clearing out a preemption list using preempt='' would cause the given qos to no longer be preempt-able until set again. -- Fix stepmgr creating job steps concurrently. -- data_parser/v0.0.40 - Avoid dumping "Infinity" for NO_VAL tagged "number" fields. -- data_parser/v0.0.41 - Avoid dumping "Infinity" for NO_VAL tagged "number" fields. -- slurmctld - Fix a potential leak while updating a reservation. -- slurmctld - Fix state save with reservation flags when a update fails. -- Fix reservation update issues with parameters Accounts and Users, when using +/- signs. -- slurmrestd - Don't dump warning on empty wckeys in: 'GET /slurmdb/v0.0.40/config' 'GET /slurmdb/v0.0.41/config' -- Fix slurmd possibly leaving zombie processes on start up in configless when the initial attempt to fetch the config fails. -- Fix crash when trying to drain a non-existing node (possibly deleted before). -- slurmctld - fix segfault when calculating limit decay for jobs with an invalid association. -- Fix IPMI energy gathering with multiple sensors. -- data_parser/v0.0.39 - Remove xassert requiring errors and warnings to have a source string. -- slurmrestd - Prevent potential segfault when there is an error parsing an array field which could lead to a double xfree. This applies to several endpoints in data_parser v0.0.39, v0.0.40 and v0.0.41. -- scancel - Fix a regression from 23.11.6 where using both the --ctld and --sibling options would cancel the federated job on all clusters instead of only the cluster(s) specified by --sibling. -- accounting_storage/mysql - Fix bug when removing an association specified with an empty partition. -- Fix setting multiple partition state restore on a job correctly. -- Fix difference in behavior when s
[slurm-users] Slurm version 24.11 is now available
We are pleased to announce the availability of the Slurm 24.11 release. To highlight some new features in 24.11: - New gpu/nvidia plugin. This does not rely on any NVIDIA libraries, and will build by default on all systems. It supports basic GPU detection and management, but cannot currently identify GPU-to-GPU links, or provide usage data as these are not exposed by the kernel driver. - Add autodetected GPUs to the output from "slurmd -C". - Added new QOS-based reports to "sreport". - Revamped network I/O with the "conmgr" thread-pool model. - Added new "hostlist function" syntax for management commands and configuration files. - switch/hpe_slingshot - Added support for hardware collectives setup through the fabric manager. (Requires SlurmctldParameters=enable_stepmgr) - Added SchedulerParameters=bf_allow_magnetic_slot configuration option to allow backfill planning for magnetic reservations. - Added new "scontrol listjobs" and "liststeps" commands to complement "listpids", and provide --json/--yaml output for all three subcommands. - Allow jobs to be submitted against multiple QOSes. - Added new experimental "oracle" backfill scheduling support, which permits jobs to be delayed if the oracle function determines the reduced fragmentation of the network topology is sufficiently advantageous. - Improved responsiveness of the controller when jobs are requeued by replacing the "db_index" identifier with a slurmctld-generated unique identifier. ("SLUID") - New options to job_container/tmpfs to permit site-specific scripts to modify the namespace before user steps are launched, and to ensure all steps are completely captured within that namespace. The Slurm documentation has also been updated to the 24.11 release. (Older versions can be found in the archive, linked from the main documentation page.) Slurm can be downloaded from https://www.schedmd.com/download-slurm/ . - Tim -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurm version 24.11.1 is now available
https://github.com/SchedMD/slurm/blob/slurm-24.11/NEWS is current. We've changed how the release branches are managed which means that the changes for each maintenance release aren't reflected in the master branch version of that file. The release-branch-specific NEWS is being updated for the existing stable releases as each new maintenance release is tagged. (It's now generated from the Changelog: commit trailers, instead of directly changed as commits are pushed.) There will likely be further changes to NEWS and RELEASE_NOTES for 25.05 when released this spring, but we haven't settled on exactly what that will look like yet. - Tim On 1/24/25 01:01, Ole Holm Nielsen via slurm-users wrote: Hi Marshall, Could you update the NEWS file? https://github.com/SchedMD/slurm/blob/master/NEWS Thanks, Ole -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unable to receive password reminder
Apologies for the confusion - we've fixed some internal mail routing for the list admin accounts and shouldn't miss requests like what you'd sent again. You've been removed. For anyone in a similar situation, the footer on all list messages mentions: "To unsubscribe send an email to slurm-users-le...@lists.schedmd.com" You do need to send that request from the address you're receiving mailing list traffic on, and will need to confirm the request by replying to an auto-generated response once. (This is to prevent someone forging your address and silently unsubscribing you.) If you want to directly manage your subscriptions, you can create an account on https://lists.schedmd.com/ matching your subscribed address, and subscribe or unsubscribe from there. Unfortunately this is a bit more involved, but was an unavoidable change when we needed to migrate to Mailman 3. - Tim On 1/14/25 06:17, Loris Bennett via slurm-users wrote: Hi, Over a week ago I sent the message below to the address I found for the list owner, but have not received a response. Does anyone know how to proceed in this case? Cheers, Loris Start of forwarded message From: Loris Bennett To: Subject: Unable to receive password reminder Date: Mon, 6 Jan 2025 08:35:42 +0100 Dear list owner, I have recently switched from reading the list via mail to using the mail to news gateway at news.gmane.io. Therefore I would like change my mailman settings in order to stop delivery of postings via mail. As I have forgotten my list password, I requested a reminder. However I get the reply that no user with the given email address was found in the user database. The addresses I tried were loris.benn...@fu-berlin.de lo...@zedat.fu-berlin.de the former being an alias for the latter. This is the email account which to which emails from the list are sent, so I am somewhat confused as to why the neither of the addresses is recognised. Could you please help me to resolve this issue? Regards Loris Bennett -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available (CVE-2025-43904)
Slurm versions 24.11.5, 24.05.8, and 23.11.11 are now available and include a fix for a recently discovered security issue. SchedMD customers were informed on April 23rd and provided a patch on request; this process is documented in our security policy. [1] A mistake with permission handling for Coordinators within Slurm's accounting system can allow a Coordinator to promote a user to Administrator. (CVE-2025-43904) Thank you to Sekou Diakite (HPE) for reporting this. Downloads are available at https://www.schedmd.com/downloads.php . Release notes follow below. - Tim [1] https://www.schedmd.com/security-policy/ -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support * Changes in Slurm 24.11.5 == -- Return error to scontrol reboot on bad nodelists. -- slurmrestd - Report an error when QOS resolution fails for v0.0.40 endpoints. -- slurmrestd - Report an error when QOS resolution fails for v0.0.41 endpoints. -- slurmrestd - Report an error when QOS resolution fails for v0.0.42 endpoints. -- data_parser/v0.0.42 - Added +inline_enums flag which modifies the output when generating OpenAPI specification. It causes enum arrays to not be defined in their own schema with references ($ref) to them. Instead they will be dumped inline. -- Fix binding error with tres-bind map/mask on partial node allocations. -- Fix stepmgr enabled steps being able to request features. -- Reject step creation if requested feature is not available in job. -- slurmd - Restrict listening for new incoming RPC requests further into startup. -- slurmd - Avoid auth/slurm related hangs of CLI commands during startup and shutdown. -- slurmctld - Restrict processing new incoming RPC requests further into startup. Stop processing requests sooner during shutdown. -- slurmcltd - Avoid auth/slurm related hangs of CLI commands during startup and shutdown. -- slurmctld: Avoid race condition during shutdown or reconfigure that could result in a crash due delayed processing of a connection while plugins are unloaded. -- Fix small memleak when getting the job list from the database. -- Fix incorrect printing of % escape characters when printing stdio fields for jobs. -- Fix padding parsing when printing stdio fields for jobs. -- Fix printing %A array job id when expanding patterns. -- Fix reservations causing jobs to be held for Bad Constraints -- switch/hpe_slingshot - Prevent potential segfault on failed curl request to the fabric manager. -- Fix printing incorrect array job id when expanding stdio file names. The %A will now be substituted by the correct value. -- Fix printing incorrect array job id when expanding stdio file names. The %A will now be substituted by the correct value. -- switch/hpe_slingshot - Fix vni range not updating on slurmctld restart or reconfigre. -- Fix steps not being created when using certain combinations of -c and -n inferior to the jobs requested resources, when using stepmgr and nodes are configured with CPUs == Sockets*CoresPerSocket. -- Permit configuring the number of retry attempts to destroy CXI service via the new destroy_retries SwitchParameter. -- Do not reset memory.high and memory.swap.max in slurmd startup or reconfigure as we are never really touching this in slurmd. -- Fix reconfigure failure of slurmd when it has been started manually and the CoreSpecLimits have been removed from slurm.conf. -- Set or reset CoreSpec limits when slurmd is reconfigured and it was started with systemd. -- switch/hpe-slingshot - Make sure the slurmctld can free step VNIs after the controller restarts or reconfigures while the job is running. -- Fix backup slurmctld failure on 2nd takeover. -- Testsuite - fix python test 130_2. -- Fix security issue where a coordinator could add a user with elevated privileges. CVE-2025-43904. * Changes in Slurm 24.05.8 == -- Testsuite - fix python test 130_2. -- Fix security issue where a coordinator could add a user with elevated privileges. CVE-2025-43904. * Changes in Slurm 23.11.11 === -- Fixed a job requeuing issue that merged job entries into the same SLUID when all nodes in a job failed simultaneously. -- Add ABORT_ON_FATAL environment variable to capture a backtrace from any fatal() message. -- Testsuite - fix python test 130_2. -- Fix security issue where a coordinator could add a user with elevated privileges. CVE-2025-43904. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm version 25.05 is now available
We are pleased to announce the availability of Slurm 25.05. The release notes summarizing the new features, and including links to the corresponding documentation, can be found at: https://slurm.schedmd.com/release_notes.html A more extensive list of changes are available in the CHANGELOG: https://github.com/SchedMD/slurm/blob/slurm-25.05/CHANGELOG/slurm-25.05.md The Slurm documentation has also been updated to the 25.05 release: https://slurm.schedmd.com Slurm can be downloaded from: https://www.schedmd.com/download-slurm/ -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com