Hi, We have a strange behaviour of Slurm after updating from 18.08.7 to 18.08.8, for jobs using --exclusive and --mem-per-cpu.
Our nodes have 128GB of memory, 28 cores.
$ srun --mem-per-cpu=30000 -n 1 --exclusive hostname
=> works in 18.08.7
=> doesn’t work in 18.08.8
In 18.08.8 :
- If mem-per-cpu of lower to (full_memory_size_of_node/nb_core_per_node), it
works fine (so lower to 4681MB).
- if mem-per-cpu of upper, the job stays pending while the starting date is to
now. In slurmctld logs, we see error "backfill: Failed to start JobId=xxxx with
reserve avail: Requested nodes are busy” every 30s : so slurmctld tries to
start it again and again.
- If I use --exclusive=user, it works.
On an other cluster, I also tried on a 19.05.2 version : I have the same
behaviour.
In slurm-19.05.3 version : the job is refused with the error : “srun: error:
Unable to allocate resources: Requested node configuration is not available”
I can’t upgrade my production cluster to 19 version… Will it be a patch for 18
version ?
We have a workaround by using --exclusive, --ntasks-per-node and (--ntasks or
—nodes).
But sometime, in depopulating mode, asking only ntasks and mem-per-cpu with
exclusive allow to change easily a job by increasing the memory per task
without knowing the memory size of the node : slurm calculate how many tasks
are distributed on the right number of nodes...
Is this new behaviour was intentional ? I can’t find anything about it in
release notes (except the patch for 19.05.3).
We have academics and non-academic user on the same cluster, so non-academic
users ask of --exclusive.
Thank you in advance for your help,
Sincerely,
Béatrice
--
Béatrice CHARTON | CRIANN
[email protected] | 745, avenue de l'Université
Tel : +33 (0)2 32 91 42 91 | 76800 Saint Etienne du Rouvray
--- Support : [email protected] ---
smime.p7s
Description: S/MIME cryptographic signature
