On 7/15/24 10:43, William VINCENT via slurm-users wrote:
I am writing to report an issue with the Slurmctld process on our RHEL 9
(Rocky Linux) .
Twice in the past 5 days, the Slurmctld process has encountered an error
that resulted in the service stopping. The error message displayed was
"double free or corruption (out)". This error has caused significant
disruption to our jobs, and we are concerned about its recurrence.
We have tried troubleshooting the issue, but we have not been able to
identify the root cause of the problem. We would appreciate any assistance
or guidance you can provide to help us resolve this issue.
Please let us know if you need any additional information or if there are
any specific steps we should take to diagnose the problem further.
You're running Slurm 22.05.9 on RockyLinux 9 (is that Rocky 9.4 or what?).
Such an old Slurm version probably hasn't been tested much on EL9 systems,
For security reasons you ought to upgrade to a recent Slurm version, just
search for "CVE" in https://github.com/SchedMD/slurm/blob/master/NEWS to
find out about security holes in older versions.
You can upgrade by 2 major releases in a single step, so you can go to
23.11.8. Upgrading Slurm is fairly easy, and I've collected various
pieces of advice in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm
Hopefully a newer Slurm version is going to solve your issue.
I hope this helps,
Ole
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com