Re: [slurm-users] Requesting total GPUs or memory, not per node.

Nadav Toledo Wed, 21 Feb 2018 21:15:11 -0800

Hey Rob,
Perhaps something in the direction of srun --ntasks=2 --gres=gpu:4 nvidia-smi , help you?
this will run two tasks each with 4 gpu and execute nvidia-smi,
the output should be similar of doing nvidia-smi on one 8 gpu server

On 22/02/2018 01:26, Rob Middleton wrote:

Hello,

I'm relatively new to administering slurm, so my apologies if I've missed something obvious.

We have nodes of 4 GPU and nodes of 8 GPU. I would like users to be able to request a total number of GPUs they require. The MPI software is not fussed how many nodes it spans.

I had hoped requests such as these would work:

#SBATCH --gres=gpu:8

#SBATCH --exclusive

#SBATCH --nodes=1-2

However as both "gres" (or an alternate workaround "mem") are per-node resources rather than per-job this doesn't work -- a pair of 4-GPU boxes can never be chosen.

So -- is there a way to do this right, or to fake it? Such jobs should run on whatever appropriate hardware configuration is first available. The submitted job script will then slightly reconfigure our software configuration depending on the hardware type it lands on, before launching via srun.

As an alternative -- I note the "heterogeneous jobs" feature. This allows jobs which require resources of "hardware config A" AND "hardware config B". Is there anyway to request one hardware configuration OR another?

I can almost fake it for a single use-case with "constraints", however this syntax doesn't seem understood by the parser code:

--constraints=[grp1|grp2|grp3|grp4]&[gpuA*1&gpuB*1]

--nodes=1-2

--exclusive

With example node configuration:

NodeName=small1 Gres=gpu:4 Feature=gpuA,grp1

NodeName=small2 Gres=gpu:4 Feature=gpuB,grp1

NodeName=small3 Gres=gpu:4 Feature=gpuB,grp2

NodeName=small4 Gres=gpu:4 Feature=gpuB,grp2
NodeName=big1 Gres=gpu:8 Feature=gpuA,gpuB,grp3
NodeName=big2 Gres=gpu:8 Feature=gpuA,gpuB,grp4

All ideas are appreciated.

Thanks,

Rob Middleton.

Re: [slurm-users] Requesting total GPUs or memory, not per node.

Reply via email to