Rather than specifying the processor types as GRES, I would recommending defining them as features of the nodes and let the users specify the features as constraints to their jobs. Since the newer processors are backwards compatible with the older processors, list the older processors as features of the newer nodes, too.

For example, say you have some nodes that support AVX512, and some that only support AVX2. node01 is older and supports only AVX2. Node02 is newer and supports AVX512, but is backwards compatible and supports AVX2. I would have something like this in my slurm.conf file:

NodeName=node01 Feature=avx2 ...
NodeName=node02 Feature=avx512,avx2 ...

I have a very hetergeneous cluster with several different generations of AMD and Intel processors, we use this method quite effectively.

If you want to continue down the road you've already started on, can you provide more information, like the partition definitions and the gres definitions? In general, Slurm should support submitting to multiple partitions.

Prentice

On 3/8/21 11:29 AM, Bas van der Vlies wrote:
Hi,

On this cluster I have version 20.02.6 installed. We have different partitions for cpu type and gpu types. we want to make it easy for the user who not care where there job runs and for the experienced user they can specify the gres type: cpu_type or gpu

I have defined 2 cpu partitions:
 * cpu_e5_2650_v1
 * cpu_e5_2650_v2

and 2 gres cpu_type:
 * e5_2650_v1
 * e5_2650_v2


When no partitions are specified it will submit to both partitions:
 * srun --exclusive  --gres=cpu_type:e5_2650_v1  --pty /bin/bash --> r16n18 wich has defined this gres and is in partition cpu_e5_2650_v1

Now I submit at the same time another job:
 * srun --exclusive  --gres=cpu_type:e5_2650_v1  --pty /bin/bash

This fails with: `srun: error: Unable to allocate resources: Requested node configuration is not available`

I would expect it gets queued in the partition `cpu_e5_2650_v1`.


When I specify the partition on the command line:
 * srun  --exclusive -p cpu_e5_2650_v1_shared --gres=cpu_type:e5_2650_v1 --pty /bin/bash

srun: job 1856 queued and waiting for resources


So the question is can slurm handle submitting to multiple partitions when we specify gres attributes?

Regards



Reply via email to