Hi Jürgen,
On 4/13/21 6:29 PM, Juergen Salk wrote:
* Heckes, Frank <hec...@mps.mpg.de> [210413 12:04]:
This result from a mgmt. - question. How long jobs have to wait (in s, min, h,
day) before they getting executed and
how many jobs are waiting (are queued) for each partition in a certain time
interval.
The first one is easy to find with sacct and submit, start counts + difference
+ averaging.
Hi Frank,
depending on the definition of "waiting time", the "reserved" field
from sacct may be more appropriate than "start" minus "submit". For
example for dependency jobs (aka chain jobs) the latter does also
count the time a job had to wait for another job to finish
whereas "reserved" will only start counting when a job becomes
eligible.
The slurmacct tool
(https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct)
calculates the waiting time as you recommend:
wait = start - eligible
I have experienced eligible == "Unknown", in which case I use the submit
time as the best guess.
However, the "eligible" and "reserved" fields in sacct will be
set or increased also if a job has hit a resource throttling limit,
which may be something you want to factor out of the job waiting time
as well.
Unfortunaty, I haven't found any metrics in sacct that does only
count (or allows to derive) the time a job had to wait just for
sufficent resources to become available. Maybe someone else?
Good point! I don't have an answer...
The second is a bit cumbersome, so I wonder whether a 'solution' is
already around. The easiest way is to monitor from the beginning and
store the squeue ouput for later evaluation. Unfortunately I didn’t
do that.
Not sure if this is a solution for you but I think you can at
least resample this retrospectively from sacct by using something like
sacct -a -X -S 2021-04-01T00:00:00 -s PD -o JobID,User,Partition
This will return job records for all jobs that were in pending state
That's a nice trick! According to the sacct man-page, when you specify
the state (-s PD) and the starttime with -S, the DEFAULT TIME WINDOW in
this case sets endtime=starttime. Thus you get a snapshot of the Pending
jobs at the instant given by -S. This could definitely be used to make
graphs of Pending jobs in each partition as a function of time.
/Ole