Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-10 Thread Arthur Gilly
Thanks Michael, set -e errexit is the same as setting #!/bin/bash -e as interpreter as far as I’m aware. As I mention in the original post, I would like to avoid that. It involves modifying scripts (although to a lesser extent), and it would end script execution for other runtime errors or non-0

Re: [slurm-users] [EXT] Re: Is there a scontrol ping slurmdbd?

2021-06-10 Thread Heitor
On Thu, 10 Jun 2021 07:20:51 + Sean Crosby wrote: > We use sacctmgr list stats for our Slurmdbd check > > Our Nagios check is > > RESULT=$(/usr/local/slurm/latest/bin/sacctmgr list stats) > if [ $? -ne 0 ] > then > echo "ERROR: cannot connect to database" > exit 2 > fi > ech

Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-10 Thread Diego Zuccato
Il 10/06/2021 11:35, Gestió Servidors ha scritto: I'm no SLURM expert, but a jobfile like this should work: #!/bin/bash # #SBATCH --job-name=N2n4 #SBATCH --partition=cuda.q #SBATCH --output=N2n4-CUDA.txt #SBATCH -N 1 # number of nodes with the first GPU #SBATCH -n 2 # number of cores #SBATCH --g

Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-10 Thread Diego Zuccato
Il 08/06/2021 15:55, Gestió Servidors ha scritto: Have you tried defining it as heterogeneus job? https://slurm.schedmd.com/heterogeneous_jobs.html #SBATCH hetjob for new SLURM versions or #SBATCH packjob for older ones HIH, Diego Hi, Today, doing some tests, I have not got a solution to

Re: [slurm-users] Job requesting two different GPUs on two

2021-06-10 Thread Gestió Servidors
Hello, No, with "#SBATCH --gres=gpu:2" SLURM searchs a node with 2 GPUs but I need to run my job in 2 nodes using 2 GPUs but one GPU in each node. If both GPUs are the same, job runs OK, but I want to test run my job in two nodes: one offers a GeForceRTX3080 and the second offers a GeForceRTX20

[slurm-users] delete account and reservations

2021-06-10 Thread Jaap Dijkshoorn
Hi, I was wondering about the following. If i have a reservation with accounts associated to it. And i delete the account with sacctmgr i do not get any message. It just delete the account. But then when you want to update the reservation (with the deleted account still associated to it) you

Re: [slurm-users] [EXT] Re: Is there a scontrol ping slurmdbd?

2021-06-10 Thread Sean Crosby
We use sacctmgr list stats for our Slurmdbd check Our Nagios check is RESULT=$(/usr/local/slurm/latest/bin/sacctmgr list stats) if [ $? -ne 0 ] then echo "ERROR: cannot connect to database" exit 2 fi echo "$RESULT" | head -n 4 exit 0 Sean From: sl