[slurm-users] Re: Software builds using slurm

2024-06-10 Thread Cutts, Tim via slurm-users
You have two options for managing those dependencies, as I see it) 1. you use SLURM’s native job dependencies, but this requires you to create a build script for SLURM 2. You use make to submit the jobs, and take advantage of the -j flag to make it run lots of tasks at once, just use a jo

[slurm-users] Re: sbatch: Node count specification invalid - when only specifying --ntasks

2024-06-10 Thread George Leaver via slurm-users
Noam, Thanks for the suggestion but no luck: sbatch -p multinode -n 80 --ntasks-per-core=1 --wrap="..." sbatch: error: Batch job submission failed: Node count specification invalid sbatch -p multinode -n 2 -c 40 --ntasks-per-core=1 --wrap="..." sbatch: error: Batch job submission failed: Node co

[slurm-users] Re: sbatch: Node count specification invalid - when only specifying --ntasks

2024-06-10 Thread Loris Bennett via slurm-users
Hi George, George Leaver via slurm-users writes: > Hello, > > Previously we were running 22.05.10 and could submit a "multinode" job > using only the total number of cores to run, not the number of nodes. > For example, in a cluster containing only 40-core nodes (no > hyperthreading), Slurm woul

[slurm-users] Re: Software builds using slurm

2024-06-10 Thread Renfro, Michael via slurm-users
At a certain point, you’re talking about workflow orchestration. Snakemake [1] and its slurm executor plugin [2] may be a starting point, especially since Snakemake is a local-by-default tool. I wouldn’t try reproducing your entire “make” workflow in Snakemake. Instead, I’d define the roughly 60

[slurm-users] scontrol create partition fails

2024-06-10 Thread Long, Daniel S. via slurm-users
Hi, I need to temporarily dedicated one of our compute nodes to a single account. To do this, I was going to create a new partition but I'm running into an error where scontrol create partition outputs "scontrol: error: Invalid input: partition Request aborted" regardless of what parameters

[slurm-users] Re: scontrol create partition fails

2024-06-10 Thread Schneider, Gerald via slurm-users
Hi Daniel, you can create a reservation for the node for the said account. Regards, Gerald Schneider -- Gerald Schneider Fraunhofer-Institut für Graphische Datenverarbeitung IGD Joachim-Jungius-Str. 11 | 18059 Rostock | Germany Tel. +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199 ge

[slurm-users] Re: scontrol create partition fails

2024-06-10 Thread Long, Daniel S. via slurm-users
This looks perfect. Thank you very much. From: Schneider, Gerald via slurm-users Sent: Monday, June 10, 2024 9:14 AM To: slurm-us...@schedmd.com Subject: [slurm-users] Re: scontrol create partition fails Hi Daniel, you can create a reservation for the node for the said account. Regards, Gerald

[slurm-users] Issue about selecting cpus for optimization

2024-06-10 Thread Purvesh Parmar via slurm-users
Hi, We have 16 nodes cluster with DGX-A100 (80 GB). We have 128 cores of each node separated in to a separate partition for cpu only jobs and 8 GPUs and 128 cores in other partitions for cpugpu jobs. We want to ensure that only selected 128 cores should be part of the cpu partition. (NUMA / Symm

[slurm-users] srun hostname - Socket timed out on send/recv operation

2024-06-10 Thread Arnuld via slurm-users
I have two machines. When I run "srum hostname" on one machine (it's both a controller and a node) then I get the hostname fine but I get socket timed out error in these two situations: 1) "srun hostname" on 2nd machine (it's a node) 2) "srun -N 2 hostname" on controller "scontrol show node" show