Similar problem in the cluster I look after. I have a job_submit script which adds certain nodes to the job's excluded nodes list based on each node's number of cpus per gpus. This basically solved problem with fragmentation entirely. The problem is that cons_tres seems to think (for example) that an 8 core job needing one GPU would be a good fit for an 8 core machine with four GPUs, leaving three GPUs unused - this would appear as "alloc". In such a case, you'd want to exclude that node since there are actually only 2 cores per GPU. This will push it onto a node with more cores per GPU.
Ours test is something like: (job cpus / job gpus) > (node cpus / node gpus) * 1.2 which allows 20% or so, since there will also be a certain percentage of jobs which need several GPUs but only a couple of cores. It's fairly simple to implement with the lua submit plugin. For newer versions of Slurm I believe it is necessary to check both tres per task and tres per node. Fortunately only one should be set. I'm not sure about the --gpus flag, we're still using --gres. Cheers, Aaron On 8 February 2021 at 11:36 GMT, Ansgar Esztermann-Kirchner wrote: > Hello List, > > we're running a heterogeneous cluster (just x86_64, but a lot of > different node types from 8 to 64 HW threads, 1 to 4 GPUs). > Our processing power (for our main application, at least) is > exclusively provided by the GPUs, so cons_tres looks quite promising: > depending on the size of the job, request an appropriate number of > GPUs. Of course, you have to request some CPUs as well -- ideally, > evenly distributed among the GPUs (e.g. 10 per GPU on a 20-core, 2-GPU > node; 16 on a 64-core, 4-GPU node). > Of course, one could use different partitions for different nodes, and > then submit individual jobs with CPU requests tailored to one such > partition, but I'd prefer a more flexible approach where a given job > could run on any large enough node. > > Is there anyone with a similar setup? Any config options I've missed, > or do you have a work-around? > > Thanks, > > A. -- Research Fellow School of Computer Science University of Nottingham This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please contact the sender and delete the email and attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham. Email communications with the University of Nottingham may be monitored where permitted by law.