Hello, On 05.02.19 16:46, Ansgar Esztermann-Kirchner wrote: > [...]-- we'd like to have two "half nodes", where > jobs will be able to use one of the two GPUs, plus (at most) half of > the CPUs. With SGE, we've put two queues on the nodes, but this > effectively prevents certain maintenance jobs from running. > How would I configure these nodes in Slurm?
why don't you use an additional "maintenance" queue/partition containing the whole nodes? Both SGE and SLURM support that. > From the docs I gathered > that MaxTRESPerJob would be a solution, but this is coupled to > associations, which I do not fully understand. > Is this the best/only way to achieve such a partioning? > If so, do I need to define an association for every user, or can I > define a default/skeleton association that new users automatically > inherit? > Are there other/better ways to go? Let's agree on "other" ;) use the OS to partition the resources on the host -- VM, systemd-nspawn, ... . Because we have to run VMs and services parallel to SLURM I tested partitioning our (small number of) hosts via Ganeti/KVM. Side-effect: I was able to live migrate the (virtual) node running jobs during maintenance. Performance was very close to bare metal and while we are currently not running our GPU jobs this way, even GPU pass-through should be possible with negligible performance penalty: Walters, et al. "GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications" https://ieeexplore.ieee.org/document/6973796?tp=&arnumber=6973796 Disadvantage of the OS-level partitioning might be additional effort that's necessary. But honestly, I even thought about stretching this even further for two reason: 1. to gain a bit more flexibility to the (poor?) elastic features of SLURM by defining purely virtual nodes of different size and start whatever selection fits on a case by case basis -- then again I wouldn't want to do that for 900 hosts without a proper helper program. 2. separate (conservative) host OS and (modern, but stable) node OS to ease up different constraints (we had back than) Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Redling ☎ +49 3641 9 44323