Re: [slurm-users] Configuration recommendations for heterogeneous cluster

Prentice Bisbal Wed, 23 Jan 2019 07:04:24 -0800

Cyrus,

Thanks for the input. Yes, I have considered features/constraints aspart of this, and I'm already using them for users to request IB. Theyare definitely a key part of my strategy. I will look into Spank andPriorityTiers. One of my goals is to reduce the amount ofscripting/customization I need to do, so if using Spank plugins requiresa lot of development on my part, that may be counterproductive for me.

There are several ways to approach this and I imagine you really wish
the users to be able to "just submit" with a minimum of effort and
information on their part while your life is also manageable for changes
or updates.

Not exactly. I wouldn't say I want them to 'just submit' with minimaleffort. I think that's a recipe for disaster - the don't specify theright time limits, or correct resources, which then causes their job tostay queued, prevent backfill scheduling from working, or they use anode with 512 GB to run a single core job that only uses 4 GB of RAM.

What I want is for my users to think about the *resources* they need fortheir job, and not what partition they submit to. Right now, they justthink about what partition they want their job to run on, and submittheir job to that partition. Often, they will always use the same queuefor every job, regardless of the differing resource requirements. Whilethere is some logic as to why my cluster is divided into the differentpartitions, I find most users ignore this information, and just alwayssubmit to the same queue, job after job, day after day, year after year.I want my users to stop thinking in terms of partition names, and startthinking in terms of what resources their job *really* needs. This willultimately improve cluster utilization, and reduce time spent in thequeue. Some users will submit a job, and as soon as it goes into thepending state, they scancel it, change the partition name to a lessutilized partition, and resubmit it in the hopes it will start runningimmediately.

Yes, there needs to be a lot of user training, and there's a lot I cando to improve the environment for my users, but making the schedulermore flexible needs to be one of the first steps in my vision to improvethings here.


Prentice

On 1/22/19 6:50 PM, Cyrus Proctor wrote:

Hi Prentice,

Have you considered Slurm features and constraints at all? You provide
features (arbitrary strings in your slurm.conf) of what your hardware
can provide ("amd", "ib", "FAST", "whatever"). A user then will list
constraints using typical and/or/regex notation ( --constraint=amd&ib ).
You may override or autofill constraint defaults yourself in your
job_submit.lua.

Another level: you may also create your own Slurm arguments to sbatch or
srun using SPANK plugins. These could be used to simplify a constraint
list in whatever way you might see fit (e.g. sbatch --fast equates to
--constraint=amd&ib&FAST ).

So, as a possibility, keep all nodes in one partition, supply the
features in slurm.conf, have job_sumbit.lua give a default set of
constraints (and/or force the user to provide a minimum set), create
another partition that includes all the nodes as well but is
preemptable/VIP/whatever (PriorityTiers work nice here too).

There are several ways to approach this and I imagine you really wish
the users to be able to "just submit" with a minimum of effort and
information on their part while your life is also manageable for changes
or updates. I find the logic of the feature/constraint system to be
quite elegant for meeting complex needs of heterogeneous systems.

Best,

Cyrus

On 1/22/19 2:49 PM, Prentice Bisbal wrote:

I left out a a *very* critical detail: One of the reasons I'm looking
at revamping my Slurm configuration is that my users have requested
the capability to submit long-running, low-priority interruptible jobs
that can be killed and requeued when shorter-running, higher-priority
jobs need to use the resources.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

On 1/22/19 3:38 PM, Prentice Bisbal wrote:

Slurm Users,

I would like your input on the best way to configure Slurm for a
heterogeneous cluster I am responsible for. This e-mail will probably
be a bit long to include all the necessary details of my environment
so thanks in advance to those of you who read all of it!

The cluster I support is a very heterogeneous cluster with several
different network technologies and generations of processors.
Although some people here refer to this cluster as numerous l
different clusters, in reality it is one cluster, since all the nodes
have their work assigned to them from a single Slurm Controller, all
the nodes use the same executables installed on a shared drive, and
all nodes are diskless and use the same NFSroot OS image, so they are
all configured 100% alike.

The cluster has been built piece-meal over a number of years, which
explains the variety of hardware/networking in use. In Slurm, each of
the different "clusters" is a separate partition intended to serve
different purposes:

Partition "E" - AMD Opteron 6320 processors, 64 GB RAM/node, 1 GbE,
meant for serial, and low task count parallel jobs that only use a
few cores and stay within a single node. Limited to 16 tasks or less
in QOS

Partition "D" - AMD Opteron 6136, 6274, and 6376 processors, 32 GB or
64 GB RAM per node, 10 GbE, meant for general-purpose parallel jobs
spanning multiple nodes. Min. Task count of 32 tasks to prevent
smaller jobs that should be run on Partition E from running here.

Partition "K"  - AMD Opteron 6274 and 6376 processors, 64 GB RAM per
node, DDR IB network, meant for tightly-coupled parallel jobs

Partition "G1" - AMD Opteron 6274, 6276, 6376, and Intel Xeon E5-2698
v3 &  E5-2630 v3 processors, RAM ranging from 128 GB - 512 GB per
node, 1 GbE Network, meant for "large memory" jobs - some nodes are
in different racks attached to different switches, so not really
optimal for multi-node jobs.

Partition "J" -  AMD Opteron 6136 Processors, 280 GB RAM per node,
DDR IB, was orginally meant for a specific project, I now need to
allow general access to it.

Partition "G2" - AMD Opteron 6136, 6274, and 6320 processors, 32 GB,
96 GB, and 128 GB RAM per node, IB network , access is restricted to
specific users/projects.

Partition "M" - Intel Xeon E5-2698 v3 and E5-2697A v4 processors, 128
GB RAM per node, 1 GbE network, reserved for running 1 specific
simulation application.

To make all this work so far, I have created a job_submit.lua script
with numerous checks and conditionals that has become quite unwieldy.
As a result, changes that should be simple take a considerable amount
of time for me to rewrite and test the script. On top of that, almost
all of the logic in that script is logic that Slurm can already
perform in a more easily manageable way. I've essentially re-invented
wheels that Slurm already provides.

Further, each partition has it's own QOS, so my job_submit.lua
assigns each job to a specific partition and QOS depending on it's
resource requirements. This means that a job may be assigned to D,
but could  also run on K if K is idle , will never be able to run on
K. This means cluster nodes could go unutilized, reducing cluster
utilization states (which management looks at), and increasing job
queue time (which users are obsessed with).

I would like to simplify this configuration as much as possible to
reduce the labor it takes me to maintain my job_submit.lua script,
and therefore make me more responsive to meeting my users needs, and
increase cluster utilization. Since I have numerous different
networks, I was thinking the I could use the topology,conf file to
keep jobs on a single network, and prevent multi-node jobs run on
partition E.  The partitions reserved for specific
projects/departments would still need to be requested explicitly.

At first, I was going to take this approach:

1. Create a single partition with all the general access nodes

2. Create a topology.conf file to make sure jobs stay within a single
network.

3. Assign weights to the different partitions to that Slurm will try
to assign jobs to them in a specific order of preference

4. Assign weights to the different nodes, so that the nodes with the
fastest processors are preferred.

After getting responses to my questions about the topology.conf file,
this seems like this approach may not be viable, or at least not be
best procedure.

I'm am now considering this:

0. Restrict access to the non-general access partitions (this is
already done for the most part, hence step 0).

1. Assign each Partition it's own QOS in the slurm.conf file.

2. Assign a weight to the partitions so Slurm attempts to assign jobs
to them in a specific order.

3. Assign weights to the nodes so the nodes are assigned in a
specific order (faster processors first)

4. Set job_submit plugin to all_partitions, or partition


Step 4 in this case is the area I'm the least familiar with. One of
the reasons we are using a job_submit.lua script is because users
will often request partitions that are inappropriate for their job
needs (like trying to run a job that spans multiple nodes on a
partition with only 1 GbE, or request partition G because it's free,
but their job only uses 1 MB of RAM). I'm also not sure if I want to
give up using job_submit.lua 100%  by switching job_submit_plugin to
"partition"

My ultimate goal is to have users specify what resources they need
without specifying a QOS or Partition,and let Slurm handle that
automatically based on the weights I assign to the nodes and
partitions.  I also don't want to lock a job to a specific partition
at submit time so Slurm can allocate it to idle nodes in a different
partition of that partition has idle nodes when the job is finally
eligible to run.

What is the best way to achieve my goals? All suggestions will be
considered.

For those of you who made it this far, thanks!

Prentice

Re: [slurm-users] Configuration recommendations for heterogeneous cluster

Reply via email to