Hi Michael,
if you submit a job-array, all resources related options (number of
nodes, tasks, cpus per task, memory, time, ...) are meant *per array-task*.
So in your case you start 100 array-tasks (you could also call them
"sub-jobs") that *each* (not your whole job) is limited to one node, on
its friday and i'm either doing something silly or have a misconfig
somewhere, i can't figure out which
when i run
sbatch --nodes=1 --cpus-per-task=1 --array=1-100 --output
test_%A_%a.txt --wrap 'uname -n'
sbatch doesn't seem to be adhering to the --nodes param. when i look
at my output files i
Just an update to say that this issue for me appears to be specific to
the `runc` runtime (or `nvidia-container-runtime` when it uses `runc`
internally). I switched to using `crun` and the problem went away --
containers run using `srun --container` now terminate after the inner
process terminates.
It could be systemd doing that. Since slurmdbd is being started with -D, I
would verify that slurmdbd.service has Type=simple and not Type=forking. The
systemctl status output later in the thread shows systemd starting slurmdbd
with -D.
If that's the slurmdbd package from Ubuntu you might f