Re: [slurm-users] sreport outputs invalid values due to corrupted data

Loris Bennett Wed, 09 Mar 2022 05:48:59 -0800

Hi Jean-Christophe,

Jean-Christophe HAESSIG <haess...@igbmc.fr> writes:


> Hi,
>
> I recently noticed impossible usage values returned by sreport, my 
> cluster was reportedly used at 100%.
>
> Upon further investigation, I found about 6000 jobs launched on 
> 2020-08-31 that were 'COMPLETED' but had their CPUTime still increasing, 
> amounting to about 500 days. The root cause for this seems to be a 
> failure of compute nodes that were decommissioned afterwards.
>
> To troubleshoot, I connected to the accounting database and found that 
> the time_end column of the cluster_job_table table was 0 for these jobs. 
> I replaced it by a meaningful value, which fixed things for sacct but 
> does only have an impact on sreport queries for recent dates.
>
> It seems that sreport takes its data from *_assoc_usage_*_table and I do 
> not know how it relates to the jobs table. Is there a way to fix the data ?

Run

  scontrol show runawayjobs

If any are found you should be offered the option of fixing them.

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

Re: [slurm-users] sreport outputs invalid values due to corrupted data

Reply via email to