This is related to this other thread: https://groups.google.com/g/slurm-users/c/88pZ400whu0/m/9FYFqKh6AQAJ AFAIK, the only rudimentary solution is the MaxCPUsPerNode partition flag, and setting independent gpu and cpu partitions, but having something like "CpusReservedPerGpu" would be nice.
@Aaron would you be willing to share such a script? El mié., 21 oct. 2020 a las 0:01, Relu Patrascu (<r...@cs.toronto.edu>) escribió: > > I thought of doing this, but, I'm guessing you don't have preemption > enabled. > > With preemption enabled this becomes more complicated, and error prone, but > > I'll think some more about it. It'd be nice leverage slurm's scheduling > engine and > > just add this constraint. > > Relu > > On 2020-10-20 16:20, Aaron Jackson wrote: > > I look after a very heterogeneous GPU Slurm setup and some nodes have > > quite few cores. We use a job_submit lua script which calculates the > > number of requested cpu cores per gpu. This is then used to scan through > > a table of 'weak nodes' based on a 'max cores per gpu' property. The > > node names are appended to the job desc exc_nodes property. > > > > It's not particularly elegant but it does work quite well for us. > > > > Aaron > > > > > > On 20 October 2020 at 18:17 BST, Relu Patrascu wrote: > > > >> Hi all, > >> > >> We have a GPU cluster and have run into this issue occasionally. Assume > >> four GPUs per node; when a user requests a GPU on such a node, and all > >> the cores, or all the RAM, the other three GPUs will be wasted for the > >> duration of the job, as slurm has no more cores or RAM available to > >> allocate those GPUs to subsequent jobs. > >> > >> > >> We have a "soft" solution to this, but it's not ideal. That is, we > >> assigned large TresBillingWeights to cpu consumption, thus discouraging > >> users to allocate many CPUs. > >> > >> > >> Ideal for us would be to be able to define a number of CPUs to always be > >> available on a node, for each GPU. Would help to a similar feature for > >> an amount of RAM. > >> > >> > >> Take for example a node that has: > >> > >> * four GPUs > >> > >> * 16 CPUs > >> > >> > >> Let's assume that most jobs would work just fine with a minimum number > >> of 2 CPUs per GPU. Then we could set in the node definition a variable > >> such as > >> > >> CpusReservedPerGpu = 2 > >> > >> The first job to run on this node could get between 2 and 10 CPUs, thus > >> 6 CPUs remaining for potential incoming jobs (2 per GPU). > >> > >> > >> We couldn't find a way to do this, are we missing something? We'd rather > >> not modify the source code again :/ > >> > >> Regards, > >> > >> Relu > > > > -- Stephan Schott Verdugo Biochemist Heinrich-Heine-Universitaet Duesseldorf Institut fuer Pharm. und Med. Chemie Universitaetsstr. 1 40225 Duesseldorf Germany