Re: [slurm-users] Job requesting two different GPUs on two different nodes

2021-06-08 Thread Loris Bennett
Hi Gestió,: Gestió Servidors writes: > Hi, > > > > Today, doing some tests, I have not got a solution to write a submit script > that requests 2 different GPUs on 2 different nodes. With this simple script: > > > > #!/bin/bash > # > #SBATCH --job-name=N2n4 > #SBATCH --output=N2n4-CUDA.txt >

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Arthur Gilly
I could say that the limit on max array sizes is lower on our cluster, and we start to see I/O problems very fast as parallelism scales (which we can limit with % as you mention). But the actual reason is simpler, as I mentioned we have an entire collection of scripts which were written for a pr

Re: [slurm-users] Maui equivalent Nodeallocationpolicy

2021-06-08 Thread Lyn Gerner
David, take a look at the various instances of the string "LLN" throughout slurm.conf, as well as pack_serial_at_end. (I suspect you may want LLN=no on your partition definition.) Best, Lyn On Tue, Jun 8, 2021 at 11:51 AM David Chaffin wrote: > replying to myself as I can't quite figure out how

Re: [slurm-users] Maui equivalent Nodeallocationpolicy

2021-06-08 Thread David Chaffin
replying to myself as I can't quite figure out how to reply to Jurgen in digest. Jurgen's pointers are good for some of our other issues, but I misstated the question, it should have been how do I send the small HTC job to the node that has currently the smallest free cores? RTFM and I think this

[slurm-users] Job requesting two different GPUs on two different nodes

2021-06-08 Thread Gestió Servidors
Hi, Today, doing some tests, I have not got a solution to write a submit script that requests 2 different GPUs on 2 different nodes. With this simple script: #!/bin/bash # #SBATCH --job-name=N2n4 #SBATCH --output=N2n4-CUDA.txt #SBATCH --gres=gpu:GeForceRTX3080:1 #SBATCH -N 2 # number of nodes #

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Renfro, Michael
Any reason *not* to create an array of 100k jobs and let the scheduler just handle things? Current versions of Slurm support arrays of up to 4M jobs, and you can limit the number of jobs running simultaneously with the '%' specifier in your array= sbatch parameter. From: slurm-users on behalf

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Arthur Gilly
Thank you Loris! Like many of our jobs, this is an embarrassingly parallel analysis, where we have to strike a compromise between what would be a completely granular array of >100,000 small jobs or some kind of serialisation through loops. So the individual jobs where I noticed this behaviou

Re: [slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Loris Bennett
Dear Arthur, Arthur Gilly writes: > Dear Slurm users, > > > > I am looking for a SLURM setting that will kill a job immediately when any > subprocess of that job hits an OOM limit. Several posts have touched upon > that, e.g: > https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg0

[slurm-users] Kill job when child process gets OOM-killed

2021-06-08 Thread Arthur Gilly
Dear Slurm users, I am looking for a SLURM setting that will kill a job immediately when any subprocess of that job hits an OOM limit. Several posts have touched upon that, e.g: https://www.mail-archive.com/slurm-users@l

Re: [slurm-users] Slurm stats in JSON format

2021-06-08 Thread Ward Poelmans
On 8/06/2021 00:27, Sid Young wrote: > Is there a tool that will extract the job counts in JSON format? Such as > #running, #in pending #onhold etc > > I am trying to build some custom dashboards for the our new cluster and this > would be a really useful set of metrics to gather and display.