Re: [slurm-users] How to avoid a feature?

Relu Patrascu Tue, 06 Jul 2021 13:18:13 -0700

We have had a similar problem, even with different partitions for CPUand GPU nodes, people still submitted jobs to the GPU nodes, and wesuspected running CPU type jobs. Doesn't help to look for the missing--gres=gpu:x because a user can ask for GPUs and simply not use them. Wethought of getting into GPU usage checks but that isn't ideal either, inpart because it makes things pretty messy if you want to get real GPUusage (and we did it for a while using NVIDIA's API for that), and inpart because there are legitimate jobs which need a GPU but notintensively (e.g. some reinforcement learning experiments).

The main currency on our cluster is the fairshare score. We do not useshares as credit points, rather as a resource that gets eroded as perresource consumption. We assigned tres billing weights on the GPU nodessuch that allocating one GPU on a four GPU node would automaticallycharge you max(N/4, M/4, G/4) if N, M, and G were cores, memory, andnumber of GPUs. To make this work we also used PriorityFlags=MAX_TRES inslurm.conf.

Now we don't have to worry about someone taking all the RAM and just 1CPU and 1 GPU on a node. They "pay" for the resource that they consume,maximally. We did have a problem where someone would allocate just 1GPU, a few CPU cores, and almost all the RAM, effectively rendering thenode useless to others. Now they pay almost for the entire node if theydo, which is the fairest charge, because nobody else can use the node.

Works for us also because we use preemption across the cluster (1hexemption) and jobs get preempted based on job priority. The more anyoneconsumes resources, the lower their fairshare score, which results inlower job priorities.


Relu



On 2021-07-01 13:21, Tina Friedrich wrote:

Hi Brian,
sometimes it would be nice if SLURM had what Grid Engine calls a'forced complex' (i.e. a feature that you *have* to request to land ona node that has it), wouldn't it?
I do something like that for all of my 'special' nodes (GPU, KNL,nodes...) - I want to avoid jobs not requesting that resource orallowing that architecture landing on it. I 'tag' all nodes with arelevant feature (cpu, gpu, knl, ...), and have a LUA submit verifierthat checks for a 'relevant' feature (or a --gres=gpu or somthing) andif there isn't one I add the 'cpu' feature to the request.
Works for us!

Tina

On 01/07/2021 15:08, Brian Andrus wrote:
All,

I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
They are cloud nodes, so weights do not work (there is an open bugabout that).
I need to have jobs 'avoid' that node by default. I am thinking I canuse a feature constraint, but that seems to only apply to those thatwant the feature. Since we have so many other users, it isn'tfeasible to have them modify their scripts, so having it avoid bydefault would work.
Any ideas how to do that? Submit LUA perhaps?

Brian Andrus

Re: [slurm-users] How to avoid a feature?

Reply via email to