Hi Diego,

On 7/23/21 12:36 PM, Diego Zuccato wrote:
I believe that slurmd reports the 15 minute CPU load average to the slurmctld, only.  So you got this information already.
Yup. It's just unexpected: if you don't know, you run pestat and see that an idle node does have a very high load :)
My users would think someone is breaking the rules...

Well, Slurm reports the 15-minute load average. I guess users will have to learn that, because we can't print help information every time.

If you run "pestat -F" it will show you (in red color) the nodes where the CPU load is outside the expected range, as given by the number of allocated cores.  That covers your situation when 0 CPUs are allocated.
That's how I noticed it.

Yes, pestat can be quite helpful :-)

I'm wondering what information you get from slurmtop, which you're missing from pestat?  Maybe an opportunity for improvement :-)
Well, it shows semi-graphically the CPU allocations for the various jobs, so users can tell at a glance if there are useable nodes for their job.

For finding idle nodes, there are better tools:

* sinfo -t idle

* showpartitions (download from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/partitions)

I added a little code to pestat now that calculates the longest hostname (minimum 8, truncated to 20 chars).  This is done by querying Slurm with "sinfo -N -O NodeList".  Can you try out this new version on your cluster?
Download: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
...
Once fixed, it seems to work OK and columns are aligned. Not the first time long names give us problems :( (users are even worse...).
Oops, I fixed this bug in the master branch now, thanks!

/Ole

Reply via email to