I can't speak to what happens on node failure, but I can at least get you a 
greatly simplified pair of scripts that will run only one copy on each node 
allocated:


#!/bin/bash
# notarray.sh
#SBATCH --nodes=28
#SBATCH --ntasks-per-node=1
#SBATCH --no-kill
echo "notarray.sh is running on $(hostname)"
srun --no-kill somescript.sh


and


#!/bin/bash
# somescript.sh
echo "somescript.sh is running on $(hostname)"


I can verify that after submitting the job with "sbatch notarray.sh":

  *   notarray.sh ran on only one allocated node, and
  *   somescript.sh ran once on each of the 28 nodes allocated, including the 
one that notarray.sh ran on.

No need to pass srun a set of parameters for how many tasks to run, since it 
can figure that out from the sbatch context.

From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Robert 
Peck <rp1...@york.ac.uk>
Date: Friday, April 16, 2021 at 2:40 PM
To: slurm-us...@schedmd.com <slurm-us...@schedmd.com>
Subject: [slurm-users] Grid engine slaughtering parallel jobs when any one of 
them fails (copy)

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.

________________________________
Excuse me, I am trying to run some software on a cluster which uses the SLURM 
grid engine. IT support at my institution have exhausted their knowledge of 
SLURM in trying to debug this rather nasty bug with a specific feature of the 
grid engine and suggested I try here for tips.

I am using jobs of the form:

#!/bin/bash
#SBATCH --job-name=name       # Job name
#SBATCH --mail-type=END,FAIL             # Mail events (NONE, BEGIN, END, FAIL, 
ALL)
#SBATCH --mail-user=my_email@thing.thing     # Where to send mail

#SBATCH --mem=2gb                        # Job memory request, not hugely 
intensive
#SBATCH --time=47:00:00                  # Time limit hrs:min:sec, the sim 
software being run from within the bash script is quite slow, extra memory 
can't speed it up and it can't run multi-core, hence long runs on weak nodes

#SBATCH --nodes=100
#SBATCH --ntasks=100
#SBATCH --cpus-per-task=1

#SBATCH --output=file_output_%j.log        # Standard output and error log
#SBATCH --account=code       # Project account
#SBATCH --ntasks-per-core=1 #only 1 task per core, must not be more
#SBATCH --ntasks-per-node=1 #only 1 task per node, must not be more
#SBATCH --ntasks-per-socket=1 #guessing here but fairly sure I don't want 
multiple instances trying to use same socket

#SBATCH --no-kill # supposedly prevents restart of other jobs on other nodes if 
one of the 100 gets a NODE_FAIL


echo My working directory is `pwd`
echo Running job on host:
echo -e '\t'`hostname` at `date`
echo

module load toolchain/foss/2018b
cd scratch
cd further_folder
chmod +x my_bash_script.sh

srun --no-kill -N "${SLURM_JOB_NUM_NODES}" -n "${SLURM_NTASKS}" 
./my_bash_script.sh

wait
echo
echo Job completed at `date`

I use a bash script to launch my special software and stuff which actually 
handles each job, this software is a bit weird and two copies of it WILL NOT 
EVER play nicely if made to share a node. Hence this job acts to launch 100 
copies on 100 nodes, each of which does its own stuff and writes out to a 
separate results file. I later proces the results files.

In my scenario I want 100 jobs to run, but if one or two failed and I only got 
99 or 95 back then I could work fine for further processing with just 99 or 95 
result files. Getting back a few less jobs then I want is no tragedy for my 
type of work.

But the problem is that when any one node has a failure, not that rare when 
you're calling for 100 nodes simultaneously, SLURM would by default murder the 
WHOLE LOT of jobs, and even more confusingly then restart a bunch of them which 
ends up with a very confusing pile of results files. I thought the --no-kill 
flag should prevent this fault, but instead of preventing the killing of all 
jobs due to a single failure it only prevents the restart, now I get a 
misleading message from the cluster telling me of a good exit code when such 
slaughter occurs, but when I log in to the cluster I discover a grid engine 
massacre of my jobs, all because just one of them failed.

I understand that for interacting jobs on many nodes then killing all of them 
because of one failure can be necessary, but my jobs are strictly parallel, no 
cross-interaction between them at all. each is an utterly separate simulation 
with different starting parameters. I need to ensure that if one job fails and 
must be killed then the rest are not affected.

I have been advised that due to the simulation software being such as to refuse 
to run >1 copy properly on any given node at once I am NOT able to use "array 
jobs" and must stick to this sort of job which requests 100 nodes this way.

Please can anyone suggest how to instruct SLURM not to massacre ALL my jobs 
because ONE (or a few) node(s) fails?

All my research is being put on hold by this bug which is making getting large 
runs out of the cluster almost impossible, a very large fraction of the jobs I 
submit has a failure on 1 of the 100 nodes and hence that very large fraction 
of my jobs get killed on all nodes even though only one is faulty. I don't get 
jobs lasting to give me many useful sets of the 100(ish) result files I need.

P.S. just to warn you, I'm not an HPC expert or a linux power user. I'm 
comfortable with linux and with command lines and technical details but will 
probably need a bit more explanation around answers than someone specialised in 
high performance computing would.

--
Thank You
Rob

P.S. unsure of whether this is how one is supposed to add forum posts to this 
google group, sent twice as I wasn't sure if the earlier one got through as I 
might not have been correctly subscribed at that time, thanks

Reply via email to