On 7/29/22 11:59, mshubham wrote:
Dear All,
I am facing an issue in SLURM(20.11.8), in which sreport cluster
utilization is 100%, and when I run sreport cluster
userutilizationbyaccount, Some user utilisation is greater than 100%,
three users including root showing utilisation over 250%, making overall
utilisation 500% (though user has not submitted any job in past one week)
It was showing some runaway jobs, but we cleared it, then again, it was
showing same runaway jobs, and we cleared it again. (both
manually/through command)
Is oversubscription enabled?
https://slurm.schedmd.com/sreport.html#SECTION_REPORT-TYPES
Do you get similar results with sacct?
Before that, we had encountered an issue in the past in which, in our
cluster with primary and backup slurm controller, we kept a common
mount point for the "StateSaveLocation" /var/share/slurm/ctld. Then we
observed a strange behaviour that " If the mount point is present and
the service is restarted on the primary controller then it replaces all
the statesavelocation files."
This resulted in cancellation of all the jobs (running, pending state),
reservations and assigns the JobID from 1 for newly submitted jobs. If
the SateSaveLocation is kept on local file system instead of shared
mount point then everything works fine even after restarting the
slurmctld service.
After that issue, utilisation is higher than expected, though it has not
impacted any real job utilisation.
Also, we have removed those user's account in SLURM, yet it is still
showing their utilisation
The database should keep previous utilization records.
Please help in resolving this issue.
Thanks and Regards,
Shubham Mehta
HPC Technology
CDAC Pune