Thanks for the explanation Brian.  Seems turning on IOMMU helped, as well as 
added sharing to slurm.conf:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU

Now all the CPUs are being used on all the compute nodes so things are working 
as expected.

Thanks to everyone else on the list who helped also, Andy, Ole, Chris, 
appreciate it!  Looking forward to helping out where I can as well.

Brian Andrus wrote on 1/28/21 15:50:
External Email

Yep, Looks like you are on the right track.

If the CPU count does not make sense to slurm, it will drain the node and jobs 
will not be able to start on them.

There does seem more to it though. Detailed info about a job and node would 
help.

The 'priority' pending jobs, you can ignore. Those aren't starting because 
another job is supposed to go first. That is the one with 'Resources' as the 
reason.

Resources means the scheduler has allocated the resources on the node such that 
there aren't any left to be used.
My bet here is that you aren't specifying memory. If you don't specify it, 
slurm assumes all memory on the node for the job. So, even if you are only 
using 1 cpu, all the memory is allocated, leaving none for any other job to run 
on the unallocated cpus.

Brian Andrus

On 1/28/2021 2:15 PM, Chandler wrote:

Brian Andrus wrote on 1/28/21 13:59:
What are the specific requests for resources from a job?
Nodes, Cores, Memory, threads, etc?

Well the jobs are only asking for 16 CPUs each.  The 255 threads is weird 
though, seems to be related to this,
https://askubuntu.com/questions/1182818/dual-amd-epyc-7742-cpus-show-only-255-threads

The vendor recommended to turn on IOMMU in the BIOS so I will try that and see 
if it helps....



Reply via email to