Matteo, a stupid question but if these are single CPU jobs why is mpirun being used?
Is your user using these 36 jobs to construct a parallel job to run charmm? If the mpirun is killed, yes all the other processes which are started by it on the other compute nodes will be killed. I suspect your user is trying to do womething "smart". You should give that person an example of how to reserve 36 cores and submit a charmm job. On 29 June 2018 at 12:13, Matteo Guglielmi <matteo.guglie...@dalco.ch> wrote: > Dear comunity, > > I have a user who usually submits 36 (identical) jobs at a time using a > simple for loop, > thus jobs are sbatched all the same time. > > Each job requests a single core and all jobs are independent from one > another (read > different input files and write to different output files). > > Jobs are then usually started during the next couple of hours, somewhat at > random > times. > > What happens then is that after a certain amount of time (maybe from 2 to > 12 hours) > ALL jobs belonging to this particular user are killed by slurm on all > nodes at exactly the > same time. > > One example: > > ### master: /var/log/slurmctld.log ### > > [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 > InitPrio=4294185624 usec=255 > ... > [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on > node38 > ... > [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 > uid 1007 > [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 > State=0x8004 NodeCnt=1 successful 0x8004 > > ### node38: /var/log/slurmd.log ### > > [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran > for 0 seconds > [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007 > [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature > plugin loaded > [2018-06-28T19:29:05.431] [718560.batch] debug level = 2 > [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks > [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started > 2018-06-28T19:29:05 > [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of > 65536 from submit host: Operation not permitted > ... > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 > (charmm) > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 > (mpirun) > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 > (slurm_script) > [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729 > [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 > CANCELLED AT 2018-06-28T23:37:53 *** > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 > (charmm) > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 > (mpirun) > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 > (slurm_script) > [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to > 718560.4294967294 > [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by > signal 15. > [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with > slurm_rc = 0, job_rc = 15 > [2018-06-28T23:37:53.512] [718560.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 > [2018-06-28T23:37:53.516] [718560.batch] done with job > > The slurm cluster has a minimal configuration: > > ClusterName=cluster > ControlMachine=master > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_Core > FastSchedule=1 > SlurmUser=slurm > SlurmdUser=root > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > StateSaveLocation=/var/spool/slurm/ > SlurmdSpoolDir=/var/spool/slurm/ > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmdPidFile=/var/run/slurmd.pid > Proctracktype=proctrack/linuxproc > ReturnToService=2 > PropagatePrioProcess=0 > PropagateResourceLimitsExcept=MEMLOCK > TaskPlugin=task/cgroup > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=300 > KillWait=30 > Waittime=0 > SlurmctldDebug=4 > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=4 > SlurmdLogFile=/var/log/slurmd.log > JobCompType=jobcomp/none > JobAcctGatherType=jobacct_gather/cgroup > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=master > AccountingStorageLoc=all > NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN > PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > Thank you for your help. > >