You have two options for managing those dependencies, as I see it)
1. you use SLURM’s native job dependencies, but this requires you to create
a build script for SLURM
2. You use make to submit the jobs, and take advantage of the -j flag to
make it run lots of tasks at once, just use a jo
Noam,
Thanks for the suggestion but no luck:
sbatch -p multinode -n 80 --ntasks-per-core=1 --wrap="..."
sbatch: error: Batch job submission failed: Node count specification invalid
sbatch -p multinode -n 2 -c 40 --ntasks-per-core=1 --wrap="..."
sbatch: error: Batch job submission failed: Node co
Hi George,
George Leaver via slurm-users writes:
> Hello,
>
> Previously we were running 22.05.10 and could submit a "multinode" job
> using only the total number of cores to run, not the number of nodes.
> For example, in a cluster containing only 40-core nodes (no
> hyperthreading), Slurm woul
At a certain point, you’re talking about workflow orchestration. Snakemake [1]
and its slurm executor plugin [2] may be a starting point, especially since
Snakemake is a local-by-default tool. I wouldn’t try reproducing your entire
“make” workflow in Snakemake. Instead, I’d define the roughly 60
Hi,
I need to temporarily dedicated one of our compute nodes to a single account.
To do this, I was going to create a new partition but I'm running into an error
where
scontrol create partition
outputs "scontrol: error: Invalid input: partition Request aborted" regardless
of what parameters
Hi Daniel,
you can create a reservation for the node for the said account.
Regards,
Gerald Schneider
--
Gerald Schneider
Fraunhofer-Institut für Graphische Datenverarbeitung IGD
Joachim-Jungius-Str. 11 | 18059 Rostock | Germany
Tel. +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199
ge
This looks perfect. Thank you very much.
From: Schneider, Gerald via slurm-users
Sent: Monday, June 10, 2024 9:14 AM
To: slurm-us...@schedmd.com
Subject: [slurm-users] Re: scontrol create partition fails
Hi Daniel,
you can create a reservation for the node for the said account.
Regards,
Gerald
Hi,
We have 16 nodes cluster with DGX-A100 (80 GB).
We have 128 cores of each node separated in to a separate partition for cpu
only jobs and 8 GPUs and 128 cores in other partitions for cpugpu jobs.
We want to ensure that only selected 128 cores should be part of the cpu
partition. (NUMA / Symm
I have two machines. When I run "srum hostname" on one machine (it's both a
controller and a node) then I get the hostname fine but I get socket timed
out error in these two situations:
1) "srun hostname" on 2nd machine (it's a node)
2) "srun -N 2 hostname" on controller
"scontrol show node" show