[slurm-users] Re: Implementing a "soft" wall clock limit

2025-06-12 Thread Davide DelVento via slurm-users
Sounds good, thanks for confirming it. Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea. If I'll implement it, I'll post in this conversation details on how I did it. Cheers On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner < aesz...@mpinat.mpg.de> wrote: >

[slurm-users] MIG H100 with xeon Intel

2025-06-12 Thread Richard Lefebvre via slurm-users
I'm having problems with Autodetect=nvml in gres.conf. I get on the controller log the following: error: _check_core_range_matches_sock: gres/gpu GRES autodetected core affinity 16-31 on node node001 doesn't match socket boundaries. (Socket 0 is cores 0-31). Consider setting SlurmdParameters=l3ca

[slurm-users] Re: Implementing a "soft" wall clock limit

2025-06-12 Thread Ansgar Esztermann-Kirchner via slurm-users
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote: > Hi Ansgar, > > This is indeed what I was looking for: I was not aware of PreemptExemptTime. > > From my cursory glance at the documentation, it seems > that PreemptExemptTime is QOS-based and not job based though. Is that > correc

[slurm-users] How are the results produced by 'seff'?

2025-06-12 Thread Loris Bennett via slurm-users
Hi, With Slurm 24.11.5 for some jobs I am seeing differences between the memory usage reported by 'seff' and that shown by Prometheus as 'cgroup_memory_rss_bytes' (and ultimately reported by 'jobstats' [1]). Certainly at the University of Delft they seem to feel that the memory usage reported by '

[slurm-users] Re: Implementing a "soft" wall clock limit

2025-06-12 Thread Davide DelVento via slurm-users
Hi Ansgar, This is indeed what I was looking for: I was not aware of PreemptExemptTime. >From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script? I'm thin

[slurm-users] Re: Implementing a "soft" wall clock limit

2025-06-12 Thread Ansgar Esztermann-Kirchner via slurm-users
Hi Davide, I think it should be possible to emulate this via preemption: if you set PreemptMode to CANCEL, a preempted job will behave just as if it reached the end of its wall time. Then, you can use PreemptExemptTime as your soft wall time limit -- the job will not be preempted before PreemptExe