Hi Diego,
On 7/23/21 12:36 PM, Diego Zuccato wrote:
I believe that slurmd reports the 15 minute CPU load average to the
slurmctld, only. So you got this information already.
Yup. It's just unexpected: if you don't know, you run pestat and see that
an idle node does have a very high load :)
My users would think someone is breaking the rules...
Well, Slurm reports the 15-minute load average. I guess users will have
to learn that, because we can't print help information every time.
If you run "pestat -F" it will show you (in red color) the nodes where
the CPU load is outside the expected range, as given by the number of
allocated cores. That covers your situation when 0 CPUs are allocated.
That's how I noticed it.
Yes, pestat can be quite helpful :-)
I'm wondering what information you get from slurmtop, which you're
missing from pestat? Maybe an opportunity for improvement :-)
Well, it shows semi-graphically the CPU allocations for the various jobs,
so users can tell at a glance if there are useable nodes for their job.
For finding idle nodes, there are better tools:
* sinfo -t idle
* showpartitions (download from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/partitions)
I added a little code to pestat now that calculates the longest hostname
(minimum 8, truncated to 20 chars). This is done by querying Slurm with
"sinfo -N -O NodeList". Can you try out this new version on your cluster?
Download: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
...
Once fixed, it seems to work OK and columns are aligned. Not the first
time long names give us problems :( (users are even worse...).
Oops, I fixed this bug in the master branch now, thanks!
/Ole