[slurm-users] Re: Recommended Stable Slurm Version for >100P Scale Clusters

Paul Edmon via slurm-users Tue, 18 Nov 2025 06:50:19 -0800

We run at about 50 PF and 1.5k nodes with about 100,000 jobs per day andwe use 25.05.4, though we tend to upgrade to latest available so we willbe upgrading to 25.11.* soon (when the .1 release comes out). If you areinterested I'm happy to share our slurm.conf.

At least from my experience the latest releases have been stable, thoughyou want to avoid .0 releases unless you want to be bleeding edge orneed a feature. Most of the kinks are worked out by .1, definitely by .2of any major release. There still may be weird edge cases but in generalit is stable.


-Paul Edmon-

On 11/16/25 10:33 AM, John Hearns via slurm-users wrote:

I would take a step back and ask how you intend to install and managethis cluster.


CPU only or GPUs ?
OS ?
Interconnect fabric?
Storage ?

Power per rack? Cooling?
Monitoring?

On Sun, Nov 16, 2025, 2:39 PM KK via slurm-users<[email protected]> wrote:


    We are currently planning to deploy a new HPC system with a total
    compute capacity exceeding 100 PF. As part of our preparation, we
    would like to understand which Slurm versions are considered
    stable and widely used at this scale.

    Could you please share your recommendations or experience regarding:

    1. Which Slurm version is currently running reliably on very
    large-scale clusters (>100 PF or >10k nodes)?

    2. Whether there are any versions we should avoid due to known
    issues at large scale.

    3. Any best practices or configuration considerations for Slurm
    deployments of this size.

--slurm-users mailing list -- [email protected]

    To unsubscribe send an email to [email protected]

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: Recommended Stable Slurm Version for >100P Scale Clusters

Reply via email to