Hi Jean-Christophe, Jean-Christophe HAESSIG <haess...@igbmc.fr> writes:
> Hi, > > I recently noticed impossible usage values returned by sreport, my > cluster was reportedly used at 100%. > > Upon further investigation, I found about 6000 jobs launched on > 2020-08-31 that were 'COMPLETED' but had their CPUTime still increasing, > amounting to about 500 days. The root cause for this seems to be a > failure of compute nodes that were decommissioned afterwards. > > To troubleshoot, I connected to the accounting database and found that > the time_end column of the cluster_job_table table was 0 for these jobs. > I replaced it by a meaningful value, which fixed things for sacct but > does only have an impact on sreport queries for recent dates. > > It seems that sreport takes its data from *_assoc_usage_*_table and I do > not know how it relates to the jobs table. Is there a way to fix the data ? Run scontrol show runawayjobs If any are found you should be offered the option of fixing them. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de