:) That was the first thing we tried/did - however, that only works if you're cluster isn't habitually 100% busy with jobs waiting. So that didn't work very well - even with the weighting set up so that the GPU were 'last resort' (after all the special high memory nodes), they were always running CPU jobs.

(And I did read a lot of the 'how can we reserve X amount of cores for GPU work' threads I could find, but none of it seemed to be very straight forward - and hey, given that they're also always using all GPUs, I don't think we're wasting resources much in this setup.)

Tina

On 02/07/2021 15:44, Jeffrey R. Lang wrote:
How about using node weights.    Weight the non-gpu nodes so that they are 
scheduled first.  The GPU nodes could have a very high weight so that the 
scheduler would consider them last for allocation. This would allow the non-gpu 
nodes to be filled first and when full schedule the GPU nodes.   User needing a 
GPU could just include a feature request which should allocate the GPU nodes as 
necessary.

Jeff


-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Loris 
Bennett
Sent: Friday, July 2, 2021 12:48 AM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] How to avoid a feature?

â—† This message was sent from a non-UWYO address. Please exercise caution when 
clicking links or opening attachments from external sources.


Hi Tina,

Tina Friedrich <tina.friedr...@it.ox.ac.uk> writes:

Hi Brian,

sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
complex' (i.e. a feature that you *have* to request to land on a node that has
it), wouldn't it?

I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - I
want to avoid jobs not requesting that resource or allowing that architecture
landing on it. I 'tag' all nodes with a relevant feature (cpu, gpu, knl, ...),
and have a LUA submit verifier that checks for a 'relevant' feature (or a
--gres=gpu or somthing) and if there isn't one I add the 'cpu' feature to the
request.

Works for us!

We just have the GPU nodes in a separate partition 'gpu' which users
have to specify if they want a GPU.  How does that approach differ from
yours in terms of functionality for you (or the users)?

The main problem with our approach is that the CPUs on the GPU nodes can
remain idle while there is a queue for the regular CPU nodes.  What I
would like is to allow short CPU-only jobs to run on the GPUs but only
allow GPU-jobs to run for longer, which I guess I could probably do
within the submit plugin.

Cheers,

Loris


Tina

On 01/07/2021 15:08, Brian Andrus wrote:
All,

I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
They are cloud nodes, so weights do not work (there is an open bug about
that).

I need to have jobs 'avoid' that node by default. I am thinking I can use a
feature constraint, but that seems to only apply to those that want the
feature. Since we have so many other users, it isn't feasible to have them
modify their scripts, so having it avoid by default would work.

Any ideas how to do that? Submit LUA perhaps?

Brian Andrus


--
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de


--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Reply via email to