A great detective story! > June15 but there is no trace of it anywhere on the disk.
Do you have the process ID (pid) of the watchdog.sh You could look in /proc/(pid) /cmdline and see what that shows On 2 July 2018 at 11:37, Matteo Guglielmi <matteo.guglie...@dalco.ch> wrote: > Unbelievable... and got it by chance. > > jobs were killed (again) at 21:04 and in the user's list of running > processes there was a 'sleep 50000' command (13 hours + 53 > minutes + 20 seconds) which was fired up exactly at the same > time. > > The watchdog.sh script (from which the sleep command is fired) > was started on June15 but there is no trace of it anywhere on the > disk. > > What's in that script I don't know but it kills all the users jobs > almost twice a day... and I've waited for it to do it again this > morning at 10:57... and sure enough all jobs disappeared and > a new sleep 50000 command was fired. > > Thank you all anyway! > > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764719.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764720.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764721.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764722.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764723.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764724.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764725.out > -rw-rw-r-- 1 moha moha 117 Jul 1 21:04 slurm-764726.out > > > [moha@master ~]$ ps aux | grep moha > moha 1695 0.0 0.0 113128 1416 ? S Jun15 0:00 sh > watchdog.sh > moha 76720 0.0 0.0 150844 2696 ? S Jun28 0:00 sshd: > moha@pts/10 > moha 76724 0.0 0.0 116692 3532 pts/10 Ss+ Jun28 0:00 -bash > moha 149663 0.0 0.0 150400 2240 ? S Jun28 0:00 sshd: > moha@pts/0 > moha 149664 0.0 0.0 116692 3536 pts/0 Ss+ Jun28 0:00 -bash > moha 156670 0.0 0.0 150400 2236 ? S Jun28 0:00 sshd: > moha@pts/5 > moha 156671 0.0 0.0 116692 3604 pts/5 Ss+ Jun28 0:00 -bash > moha 164364 0.0 0.0 107904 608 ? S 21:04 0:00 sleep > 50000 <<<<<<<<<<=========== !!!! > moha 190871 0.0 0.0 116684 3472 pts/4 S 21:46 0:00 -bash > moha 194080 0.0 0.0 151060 1820 pts/4 R+ 21:52 0:00 ps aux > moha 194081 0.0 0.0 112664 972 pts/4 S+ 21:52 0:00 grep > --color=auto moha > > > ________________________________ > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Thomas M. Payerle <paye...@umd.edu> > Sent: Friday, June 29, 2018 7:34:09 PM > To: Slurm User Community List > Subject: Re: [slurm-users] All user's jobs killed at the same time on all > nodes > > A couple comments/possible suggestions. > > First, it looks to me that all the jobs are run from the same directory > with same input/output files. Or am I missing something? > > Also, what MPI library is being used? > > I would suggest verifying if any of the jobs in question are terminating > normally. I.e., is the mysterious issue which is causing all the user's > jobs to terminate triggered by the completion of one of the jobs. > > I recall having an issue years ago with MPICH MPI libraries when having > multiple MPI jobs from the same user running on the same node. IIRC, when > one job terminated (usually successfully), it would call mpdallexit, which > would happily kill all the mpds for that user on that node, making the > other MPI jobs that user had on that node quite unhappy. The solution was > to set the environmental variable MPD_CON_EXT to unique values for each of > the jobs. See e.g. https://lists.mcs.anl.gov/ > pipermail/mpich-discuss/2008-May/003605.html > > My users primarily use OpenMPI, and so do not have much recent experience > with this issue. IIRC, this issue only impacted other MPI jobs running by > the same user on the same node, so a bit different than the symptoms as you > describe them (impacting all MPI jobs running by the same user on ANY > node), but as some similarity in the symptoms I thought I would mention it > anyway. > > > On Fri, Jun 29, 2018 at 7:24 AM, John Hearns <hear...@googlemail.com< > mailto:hear...@googlemail.com>> wrote: > I have got this all wrong. Paddy Doyle has got it right. > > However are you SURE than mpirun is not creating tasks on the other > machines? > I would look at the compute nodes while the job is running and do > ps -eaf --forest > > Also using mpirun to run a single core gives me the heebie-jeebies... > > https://en.wikipedia.org/wiki/Heebie-jeebies_(idiom) > > > > > On 29 June 2018 at 13:16, Matteo Guglielmi <matteo.guglie...@dalco.ch< > mailto:matteo.guglie...@dalco.ch>> wrote: > You are right but I'm actually supporting the system administrator of that > cluster, I'll mention this to him. > > Beside that, > > the user runs this for loop to submit the jobs: > > > # submit.sh # > > typeset -i i=1 > typeset -i j=12500 #number of frames goes to each core = number of frames > (1000000)/40 (cores) = > typeset -i k=1 > > while [ $i -le 36 ] #the number of frames > do > > sbatch run-5o$i.sh $i $j $k > > i=$i+1 # number of frames goes to each node (5*200 = 1000) > done > > where each run-5oXX.sh jobfile looks like this: > > > #!/bin/bash > > #SBATCH --job-name=charmm-test > #SBATCH --nodes=1 > #SBATCH --ntasks=1 > #SBATCH --cpus-per-task=1 > > export PATH=/usr/lib64/openmpi/bin/:$PATH > export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH > > mpirun -np 1 /opt/cluster/programs/charmm/c42b2/exec/gnu_M/charmm < > newphcnl99a0.inp > newphcnl99a0.out > > > > > so they are all independent mpiruns... if one of them is killed, why > would all others go down as well? > > > That would make sense if a single mpirun is running 36 tasks... but the > user is not doing this. > > ________________________________ > From: slurm-users <slurm-users-boun...@lists.schedmd.com<mailto:slurm- > users-boun...@lists.schedmd.com>> on behalf of John Hearns < > hear...@googlemail.com<mailto:hear...@googlemail.com>> > Sent: Friday, June 29, 2018 12:52:41 PM > To: Slurm User Community List > Subject: Re: [slurm-users] All user's jobs killed at the same time on all > nodes > > Matteo, a stupid question but if these are single CPU jobs why is mpirun > being used? > > Is your user using these 36 jobs to construct a parallel job to run charmm? > If the mpirun is killed, yes all the other processes which are started by > it on the other compute nodes will be killed. > > I suspect your user is trying to do womething "smart". You should give > that person an example of how to reserve 36 cores and submit a charmm job. > > > On 29 June 2018 at 12:13, Matteo Guglielmi <matteo.guglie...@dalco.ch< > mailto:matteo.guglie...@dalco.ch><mailto:matteo.guglie...@dalco.ch<mailto: > matteo.guglie...@dalco.ch>>> wrote: > Dear comunity, > > I have a user who usually submits 36 (identical) jobs at a time using a > simple for loop, > thus jobs are sbatched all the same time. > > Each job requests a single core and all jobs are independent from one > another (read > different input files and write to different output files). > > Jobs are then usually started during the next couple of hours, somewhat at > random > times. > > What happens then is that after a certain amount of time (maybe from 2 to > 12 hours) > ALL jobs belonging to this particular user are killed by slurm on all > nodes at exactly the > same time. > > One example: > > ### master: /var/log/slurmctld.log ### > > [2018-06-28T18:43:06.871] _slurm_rpc_submit_batch_job: JobId=718560 > InitPrio=4294185624 usec=255 > ... > [2018-06-28T19:29:04.671] backfill: Started JobID=718560 in partition on > node38 > ... > [2018-06-28T23:37:53.471] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 718560 > uid 1007 > [2018-06-28T23:37:53.472] _job_signal: 9 of running JobID=718560 > State=0x8004 NodeCnt=1 successful 0x8004 > > ### node38: /var/log/slurmd.log ### > > [2018-06-28T19:29:05.410] _run_prolog: prolog with lock for job 718560 ran > for 0 seconds > [2018-06-28T19:29:05.410] Launching batch job 718560 for UID 1007 > [2018-06-28T19:29:05.427] [718560.batch] Munge cryptographic signature > plugin loaded > [2018-06-28T19:29:05.431] [718560.batch] debug level = 2 > [2018-06-28T19:29:05.431] [718560.batch] starting 1 tasks > [2018-06-28T19:29:05.431] [718560.batch] task 0 (69791) started > 2018-06-28T19:29:05 > [2018-06-28T19:29:05.440] [718560.batch] Can't propagate RLIMIT_NOFILE of > 65536 from submit host: Operation not permitted > ... > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69794 > (charmm) > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69792 > (mpirun) > [2018-06-28T23:37:53.480] [718560.batch] Sending signal 18 to pid 69791 > (slurm_script) > [2018-06-28T23:37:53.480] [718560.batch] Sent signal 18 to 718560.429496729 > [2018-06-28T23:37:53.485] [718560.batch] error: *** JOB 718560 ON node38 > CANCELLED AT 2018-06-28T23:37:53 *** > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69794 > (charmm) > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69792 > (mpirun) > [2018-06-28T23:37:53.488] [718560.batch] Sending signal 15 to pid 69791 > (slurm_script) > [2018-06-28T23:37:53.488] [718560.batch] Sent signal 15 to > 718560.4294967294 > [2018-06-28T23:37:53.492] [718560.batch] task 0 (69791) exited. Killed by > signal 15. > [2018-06-28T23:37:53.512] [718560.batch] job 718560 completed with > slurm_rc = 0, job_rc = 15 > [2018-06-28T23:37:53.512] [718560.batch] sending > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15 > [2018-06-28T23:37:53.516] [718560.batch] done with job > > The slurm cluster has a minimal configuration: > > ClusterName=cluster > ControlMachine=master > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_Core > FastSchedule=1 > SlurmUser=slurm > SlurmdUser=root > SlurmctldPort=6817 > SlurmdPort=6818 > AuthType=auth/munge > StateSaveLocation=/var/spool/slurm/ > SlurmdSpoolDir=/var/spool/slurm/ > SwitchType=switch/none > MpiDefault=none > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmdPidFile=/var/run/slurmd.pid > Proctracktype=proctrack/linuxproc > ReturnToService=2 > PropagatePrioProcess=0 > PropagateResourceLimitsExcept=MEMLOCK > TaskPlugin=task/cgroup > SlurmctldTimeout=300 > SlurmdTimeout=300 > InactiveLimit=0 > MinJobAge=300 > KillWait=30 > Waittime=0 > SlurmctldDebug=4 > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=4 > SlurmdLogFile=/var/log/slurmd.log > JobCompType=jobcomp/none > JobAcctGatherType=jobacct_gather/cgroup > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=master > AccountingStorageLoc=all > NodeName=node[01-45] Sockets=2 CoresPerSocket=10 State=UNKNOWN > PartitionName=partition Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > Thank you for your help. > > > > > > > > -- > Tom Payerle > DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu<mailto:payerle > @umd.edu> > 5825 University Research Park (301) 405-6135 > University of Maryland > College Park, MD 20740-3831 > >