Hi Loris,
we didn't want to have too many partitions, mainly; so we were after a
way to have the GPU nodes not separated out.
Partly it is because we wanted to be able to easily use 'idle' CPUs on
GPU nodes - although I currently only allow that on some of them (I
simply also tag them with 'cpu'). Having them in a separate partition
would mean users would have to change where they submit to, or I would
have to mess with that in the verifier...
Also - for various reasons, we'd end up with a lot of partitions
(something like 10 or 12) - that seemed a lot of partitions. We liked it
better having the GPU nodes not separated out & teach users to specify
their resources properly (the GPUs are a very mixed bunch, as well.)
We did think about having 'hidden' GPU partitions instead of wrangling
it with features, but there didn't seem to be any benefit to that that
we could see.
Tina
On 02/07/2021 06:48, Loris Bennett wrote:
Hi Tina,
Tina Friedrich <tina.friedr...@it.ox.ac.uk> writes:
Hi Brian,
sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
complex' (i.e. a feature that you *have* to request to land on a node that has
it), wouldn't it?
I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - I
want to avoid jobs not requesting that resource or allowing that architecture
landing on it. I 'tag' all nodes with a relevant feature (cpu, gpu, knl, ...),
and have a LUA submit verifier that checks for a 'relevant' feature (or a
--gres=gpu or somthing) and if there isn't one I add the 'cpu' feature to the
request.
Works for us!
We just have the GPU nodes in a separate partition 'gpu' which users
have to specify if they want a GPU. How does that approach differ from
yours in terms of functionality for you (or the users)?
The main problem with our approach is that the CPUs on the GPU nodes can
remain idle while there is a queue for the regular CPU nodes. What I
would like is to allow short CPU-only jobs to run on the GPUs but only
allow GPU-jobs to run for longer, which I guess I could probably do
within the submit plugin.
Cheers,
Loris
Tina
On 01/07/2021 15:08, Brian Andrus wrote:
All,
I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
They are cloud nodes, so weights do not work (there is an open bug about
that).
I need to have jobs 'avoid' that node by default. I am thinking I can use a
feature constraint, but that seems to only apply to those that want the
feature. Since we have so many other users, it isn't feasible to have them
modify their scripts, so having it avoid by default would work.
Any ideas how to do that? Submit LUA perhaps?
Brian Andrus
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk