On 3/9/21 3:16 AM, Ward Poelmans wrote:

Hi Prentice,

On 8/03/2021 22:02, Prentice Bisbal wrote:

I have a very hetergeneous cluster with several different generations of
AMD and Intel processors, we use this method quite effectively.
Could you elaborate a bit more and how you manage that? Do you force you
users to pick a feature? What if a user submits a multi node job, can
you make sure it will not start over a mix of avx512 and avx2 nodes?

I don't force the users to pick a feature, and to make matters worse, I think our login nodes are newer than some of the compute nodes, so it's entirely possible that if someone really optimizes their code for one of the login nodes, their job could get assigned to a node that doesn't understand the instruction set, resulting in the dreaded "Illegal Instruction" error. Suprisingly, this has only happened a few times in the 5 years I've been at this job.

I assume most users would want to use the newest and fastest processors if given the choice, so I set the priority weighting of the nodes so that the newest nodes are highest priority, and the oldest nodes the lowest priority.

The only way to make sure the processors stick to a certain instruction set, is if they specify the processor model, rather then than the instruction set family. For example

-C 7281 will get you only AMD EPYC 7281 processors

and

-C 6376 will get you only AMD Opteron 6376 processors

Using your example, if you don't want to mix AVX2 and AVX512 processors in the same job ever, you can "lie" to Slurm in your topology file and come up with a topology where the two subsets of nodes can't talk to each other. That way, Slurm will not mix nodes of the different instruction sets. The problem with this is that it's a "permanent" solution - it's not flexible. I would imagine there are times when you would want to use both your AVX2 and AVX512 processors in a single job.

I do something like this because we have 10 nodes set aside for serial jobs that are connected only by 1 GbE. We obviously don't want internode jobs running there, so in my topology file, each of those nodes has it's own switch that's not connected to any other switch.


If you want to continue down the road you've already started on, can you
provide more information, like the partition definitions and the gres
definitions? In general, Slurm should support submitting to multiple
partitions.
As far as I understood it, you can give a comma separated list of
partitions to sbatch but it's not possible to this by default?


Incorrect. Giving a comma separated list is possible and is the default behavior for Slurm. From the sbatch documentation (emphasis added to the relevant sentence):

*-p*, *--partition*=</partition_names/>
    Request a specific partition for the resource allocation. If not
    specified, the default behavior is to allow the slurm controller
    to select the default partition as designated by the system
    administrator. *If the job can use more than one partition,
    specify their names in a comma separate list and the one offering
    earliest initiation will be used with no regard given to the
    partition name ordering (although higher priority partitions will
    be considered first).* When the job is initiated, the name of the
    partition used will be placed first in the job record partition
    string.

But you can't have a job *span* multiple partitions, but I don't think that was ever your goal.


Prentice

Reply via email to