[slurm-users] Slurm release candidate version 25.05.0rc1 is available for testing

2025-05-13 Thread Marshall Garey via slurm-users
We are pleased to announce the availability of Slurm release candidate 25.05.0rc1. To highlight some new features coming in 25.05: - Support for defining multiple topology configurations, and varying them by partition. - Support for tracking and allocating hierarchical resources. - Dynamic no

[slurm-users] Re: Performance Issues after Update to 24.11.5

2025-05-13 Thread Tilman Hoffbauer via slurm-users
Actually, you were right. By setting LLMNR=no in /etc/systemd/resolved.conf on g-vm03, which turns off link-local multicast name resolution, we were able to speed up getent hosts ougaXX significantly, which solves the issue. Thanks! On 5/13/25 15:50, John Hearns via slurm-users wrote: I think

[slurm-users] Re: Performance Issues after Update to 24.11.5

2025-05-13 Thread John Hearns via slurm-users
I think that looks OK. Forget my response. On Tue, 13 May 2025 at 14:09, Tilman Hoffbauer via slurm-users < slurm-users@lists.schedmd.com> wrote: > Thank you for your response. nslookup on e.g. ouga20 is instant, getent > hosts ouga20 takes about 1.6 seconds from g-vm03. It is about the same > s

[slurm-users] Re: Performance Issues after Update to 24.11.5

2025-05-13 Thread Tilman Hoffbauer via slurm-users
Thank you for your response. nslookup on e.g. ouga20 is instant, getent hosts ouga20 takes about 1.6 seconds from g-vm03. It is about the same speed for ouga20 looking up g-vm03. Is this too slow? On 5/13/25 15:01, John Hearns wrote: Stupid response from me.  A lng time ago I ha issues wit

[slurm-users] Re: Performance Issues after Update to 24.11.5

2025-05-13 Thread John Hearns via slurm-users
Stupid response from me. A lng time ago I ha issues with slow response on PBS. The cause was name resolution. On your setup is name resolution OK? Can you look up host names without delays? On Tue, 13 May 2025 at 13:50, Tilman Hoffbauer via slurm-users < slurm-users@lists.schedmd.com> wrote:

[slurm-users] Performance Issues after Update to 24.11.5

2025-05-13 Thread Tilman Hoffbauer via slurm-users
Hello, we are running a SLURM-managed cluster with one control node (g-vm03) and 26 worker nodes (ouga[03-28]) on Rocky 8. We recently updated from 20.11.9 through 23.02.8 to 24.11.0 and then 24.11.5. Since then, we are experiencing performance issues - squeue and scontrol ping are slow to re

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

2025-05-13 Thread Patrick Begou via slurm-users
Hi all, another point you may notice is the growing size of /dev/shm when MPI jobs do not exit properly. This also has a cost in system memory on small configurations, even if this storage is limited in size. I'm using a cron to clean this periodically. I'm not the author, see https://docs.hpc

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

2025-05-13 Thread Xaver Stiensmeier via slurm-users
Thank you very much for the many experiences shared - especially for pointing out how RAM requirements may grow over time! Our instances can vary wildly from 2 GB (rather unreasonable for Slurm) to multiple TB of RAM and given that we only provide resources and tools but not manage the runnin