Hello Benson,

On 24/11/2020 14.20, Benson Muite wrote:
Am setting up SLURM on a single shared memory machine. Found the following blog post:
http://rolk.github.io/2015/04/20/slurm-cluster

sorry, but that is only a random, outdated blog post from 2015.
Even the Debian 9 stretch provided Slurm 16.05 has automatic handling of cgroups -- you don't have to set them up manually.

I recommend looking up which version is packaged for your distribution if you're not going for compilation from source and depending on your choice start either with the official documentation for the current version 20.11
https://slurm.schedmd.com/,
or with the documentation related to your available, packaged version in the archive
https://slurm.schedmd.com/archive/


The main suggestion is to use cgroups to partition the resources. Are ther any other suggestions of changes to implement that differ from the standard cluster setup?

I would start with the defaults and read, read, ..., read, while trying to add features step by step. I think imitating a setup you don't really understand is a really bad idea, there will be more than enough questions, even when starting with the basics.

Start slow. Try and look up the defaults of your packaged version, if you're compiling from source, use the config generators from SchedMD after reading through the basics. You could run multiple jobs on a single node before cgroups. Try to find the relevant sections in the official documentation to understand, how that works, it's limitations and why it might be a good thing to use cgroups nowadays. Again, having a vague idea that you want to "partition" the node won't bring you very far. IMO it's better to have at least a basic idea of Slurm operation.
[D.C.]
What do you want next?
(The first thing I wanted in a cluster was select/cons_res with CR_Core_Memory instead of the default select/linear. RTFM what that all means and why that is/isn't a good idea in your case; when or when not to use CR_CPU_Memory; next was understanding backfilling and it's requirement for time limits).
Read.
Test.
Optionally ask on the list if you're having a single concrete issue.
[Da Capo al Fine]

(In parallel and repeatedly):
1a) The official documentation
1b) Ole Holm Nielson's docs, starting at https://wiki.fysik.dtu.dk/niflheim/Slurm_installation -- even if you're using a Debian-based distribution, read it to get an understanding of the different parts a Slurm installation is made of.

(For anything beyond the basics you didn't grasp from 1a & b):
2) Blog of Chris Samuel (csamuel.org)

Reading the list for a longer time and trying to understand the topics that might be applicable to your setup will help a lot -- and you'll notice who else on the list you prefer reading / who is willing to answer questions you have / has similar issues that get answers you can learn from.

Good luck,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Redling

Reply via email to