ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 drain ip-172-31-80-232
● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-05-11 17:30:43 UTC; 25s ago Docs: man:slurmd(8) Process: 2547 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 2567 (slurmd) Tasks: 1 (limit: 1121) CGroup: /system.slice/slurmd.service └─2567 /usr/sbin/slurmd May 11 17:30:43 ip-172-31-80-232 systemd[1]: Starting Slurm node daemon... May 11 17:30:43 ip-172-31-80-232 systemd[1]: slurmd.service: Can't open PID file /var/run/slurm-llnl/slurmd.pid (yet?) after start: No such file or directory May 11 17:30:43 ip-172-31-80-232 systemd[1]: Started Slurm node daemon. This looks reasonable to me? On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <a...@calicolabs.com> wrote: > You will want to look at the output of 'sinfo' and 'scontrol show node' to > see what slurmctld thinks about your compute nodes; then on the compute > nodes you will want to check the status of the slurmd service ('systemctl > status -l slurmd') and possibly read through the slurmd logs as well. > > On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com> > wrote: > >> Hello; >> >> I am in the process of familiarizing myself with slurm - I will write a >> piece of software which will submit jobs to a slurm cluster. Right now I >> have just made my own "cluster" consisting of one Amazon AWS node and use >> that to familiarize myself with the sxxx commands - has worked nicely. >> >> Now I just brought this AWS node completely to it's knees (not slurm >> related) and had to stop and start the node from the AWS console - during >> that process a job managed by slurm was killed hard. Now when the node is >> back up again slurm refuses to start up jobs - the queue looks like this: >> >> ubuntu@ip-172-31-80-232:~$ squeue >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> 186 debug tmp-file www-data PD 0:00 1 >> (Resources) >> 187 debug tmp-file www-data PD 0:00 1 >> (Resources) >> 188 debug tmp-file www-data PD 0:00 1 >> (Resources) >> 189 debug tmp-file www-data PD 0:00 1 >> (Resources) >> >> I.e. the jobs are pending due to Resource reasons, but no jobs are >> running? I have tried scancel all jobs, but when I add new jobs they again >> just stay pending. It should be said that when the node/slurm came back up >> again the offending job which initially created the havoc was still in >> "Running" state, but the filesystem of that job had been completely wiped >> so it was not in a sane state. scancel of this job worked fine - but no new >> jobs will start. Seems like there is "ghost job" blocking the other jobs >> from starting? I even tried to reinstall slurm using the package manager, >> but the new slurm installation would still not start jobs. Any tips on how >> I can proceed to debug this? >> >> Regards >> >> Joakim >> >