Sounds good, thanks for confirming it.
Let me sleep on it wrt the "too many" QOS, or think if I should ditch this
idea.
If I'll implement it, I'll post in this conversation details on how I did
it.
Cheers
On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner <
aesz...@mpinat.mpg.de> wrote:
>
I'm having problems with Autodetect=nvml in gres.conf.
I get on the controller log the following:
error: _check_core_range_matches_sock: gres/gpu GRES autodetected core
affinity 16-31 on node node001 doesn't match socket boundaries. (Socket 0
is cores 0-31). Consider setting SlurmdParameters=l3ca
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
> Hi Ansgar,
>
> This is indeed what I was looking for: I was not aware of PreemptExemptTime.
>
> From my cursory glance at the documentation, it seems
> that PreemptExemptTime is QOS-based and not job based though. Is that
> correc
Hi,
With Slurm 24.11.5 for some jobs I am seeing differences between the
memory usage reported by 'seff' and that shown by Prometheus as
'cgroup_memory_rss_bytes' (and ultimately reported by 'jobstats' [1]).
Certainly at the University of Delft they seem to feel that the memory
usage reported by '
Hi Ansgar,
This is indeed what I was looking for: I was not aware of PreemptExemptTime.
>From my cursory glance at the documentation, it seems
that PreemptExemptTime is QOS-based and not job based though. Is that
correct? Or could it be set per-job, perhaps on a prolog/submit lua script?
I'm thin
Hi Davide,
I think it should be possible to emulate this via preemption: if you
set PreemptMode to CANCEL, a preempted job will behave just as if it
reached the end of its wall time. Then, you can use PreemptExemptTime
as your soft wall time limit -- the job will not be preempted before
PreemptExe