You are right but I'm actually supporting the system administrator of that cluster, I'll mention this to him.
Beside that, the user runs this for loop to submit the jobs: # submit.sh # typeset -i i=1 typeset -i j=12500 #number of frames goes to each core = number of frames (1000000)/40 (cores) = typeset -i k=1 while [ $i -le 36 ] #the number of frames do sbatch run-5o$i.sh $i $j $k i=$i+1 # number of frames goes to each node (5*200 = 1000) done where each run-5oXX.sh jobfile looks like this: #!/bin/bash #SBATCH --job-name=charmm-test #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 export PATH=/usr/lib64/openmpi/bin/:$PATH export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm < newphcnl99a0.inp > newphcnl99a0.out so they are all independent mpiruns... if one of them is killed, why would all others go down as well? That would make sense if a single mpirun is running 36 tasks... but the user is not doing this. ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of John Hearns <hear...@googlemail.com> Sent: Friday, June 29, 2018 12:52:41 PM To: Slurm User Community List Subject: Re: [slurm-users] All user's jobs killed at the same time on all nodes Matteo, a stupid question but if these are single CPU jobs why is mpirun being used? Is your user using these 36 jobs to construct a parallel job to run charmm? If the mpirun is killed, yes all the other processes which are started by it on the other compute nodes will be killed. I suspect your user is trying to do womething "smart". You should give that person an example of how to reserve 36 cores and submit a charmm job. On 29 June 2018 at 12:13, Matteo Guglielmi <matteo.guglie...@dalco.ch<mailto:matteo.guglie...@dalco.ch>> wrote: Dear comunity, I have a user who usually submits 36 (identical) jobs at a time using a simple for loop, thus jobs are sbatched all the same time. Each job requests a single core and all jobs are independent from one another (read different input files and write to different output files). Jobs are then usually started during the next couple of hours, somewhat at random times. What happens then is that after a certain amount of time (maybe from 2 to 12 hours) ALL jobs belonging to this particular user are killed by slurm on all nodes at exactly the same time. One example: ### master: /var/log/slurmctld.log ### [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 InitPrio=4294185624 usec=255 ... [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on node38 ... [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 uid 1007 [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 State=0x8004 NodeCnt=1 successful 0x8004 ### node38: /var/log/slurmd.log ### [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran for 0 seconds [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007 [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature plugin loaded [2018-06-28T19:29:05.431] [718560.batch] debug level = 2 [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started 2018-06-28T19:29:05 [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of 65536 from submit host: Operation not permitted ... [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 (charmm) [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 (mpirun) [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 (slurm_script) [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729 [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 CANCELLED AT 2018-06-28T23:37:53 *** [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 (charmm) [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 (mpirun) [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 (slurm_script) [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to 718560.4294967294 [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by signal 15. [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with slurm_rc = 0, job_rc = 15 [2018-06-28T23:37:53.512] [718560.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 [2018-06-28T23:37:53.516] [718560.batch] done with job The slurm cluster has a minimal configuration: ClusterName=cluster ControlMachine=master SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core FastSchedule=1 SlurmUser=slurm SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge StateSaveLocation=/var/spool/slurm/ SlurmdSpoolDir=/var/spool/slurm/ SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid Proctracktype=proctrack/linuxproc ReturnToService=2 PropagatePrioProcess=0 PropagateResourceLimitsExcept=MEMLOCK TaskPlugin=task/cgroup SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 SlurmctldDebug=4 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=4 SlurmdLogFile=/var/log/slurmd.log JobCompType=jobcomp/none JobAcctGatherType=jobacct_gather/cgroup AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=master AccountingStorageLoc=all NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP Thank you for your help.