There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to see if that solves the problem.
On Mon, Jan 20, 2020 at 3:41 PM Carlos Fenoy <mini...@gmail.com> wrote: > It seems to me that the problem is between the slurmctld and slurmd. When > slurmd starts it sends a message to the slurmctld, that's why it appears > idle. Every now and then the slurmctld will try to ping the slurmd to check > if it's still alive. This ping doesn't seem to be working, so as I > mentioned previously, check the slurmctld log and the connectivity between > the slurmctld node and the slurmd node. > > On Mon, 20 Jan 2020, 22:43 Brian Andrus, <toomuc...@gmail.com> wrote: > >> Check the slurmd log file on the node. >> >> Ensure slurmd is still running. Sounds possible that OOM Killer or such >> may be killing slurmd >> >> Brian Andrus >> On 1/20/2020 1:12 PM, Dean Schulze wrote: >> >> If I restart slurmd the asterisk goes away. Then I can run the job once >> and the asterisk is back, and the node remains in comp*: >> >> [liqid@liqidos-dean-node1 ~]$ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 1 idle liqidos-dean-node1 >> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname >> liqidos-dean-node1 >> [liqid@liqidos-dean-node1 ~]$ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 1 comp* liqidos-dean-node1 >> >> I can get it back to idle* with scontrol: >> >> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update >> NodeName=liqidos-dean-node1 State=down Reason=none >> [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update >> NodeName=liqidos-dean-node1 State=resume >> [liqid@liqidos-dean-node1 ~]$ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 1 idle* liqidos-dean-node1 >> >> I'm beginning to wonder if I got some bad code from github. >> >> >> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <mini...@gmail.com> wrote: >> >>> Hi, >>> >>> The * next to the idle status in sinfo means that the node is >>> unreachable/not responding. Check the status of the slurmd on the node and >>> check the connectivity from the slurmctld host to the compute node (telnet >>> may be enough). You can also check the slurmctld logs for more information. >>> >>> Regards, >>> Carlos >>> >>> On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schu...@gmail.com> >>> wrote: >>> >>>> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 >>>> code base. It's behavior is strange to say the least. >>>> >>>> The controller was built from the same code base, but on Ubuntu 19.10. >>>> The controller reports the nodes state with sinfo, but can't run a simple >>>> job with srun because it thinks the node isn't available, even when it is >>>> idle. (And squeue shows an empty queue.) >>>> >>>> On the controller: >>>> $ srun -N 1 hostname >>>> srun: Required node not available (down, drained or reserved) >>>> srun: job 30 queued and waiting for resources >>>> ^Csrun: Job allocation 30 has been revoked >>>> srun: Force Terminated job 30 >>>> $ sinfo >>>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>>> debug* up infinite 1 idle* liqidos-dean-node1 >>>> $ squeue >>>> JOBID PARTITION USER ST TIME NODES >>>> NODELIST(REASON) >>>> >>>> >>>> When I try to run the simple job on the node I get: >>>> >>>> [liqid@liqidos-dean-node1 ~]$ sinfo >>>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>>> debug* up infinite 1 idle* liqidos-dean-node1 >>>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname >>>> srun: Required node not available (down, drained or reserved) >>>> srun: job 27 queued and waiting for resources >>>> ^Csrun: Job allocation 27 has been revoked >>>> [liqid@liqidos-dean-node1 ~]$ squeue >>>> JOBID PARTITION USER ST TIME NODES >>>> NODELIST(REASON) >>>> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname >>>> srun: Required node not available (down, drained or reserved) >>>> srun: job 28 queued and waiting for resources >>>> ^Csrun: Job allocation 28 has been revoked >>>> [liqid@liqidos-dean-node1 ~]$ sinfo >>>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>>> debug* up infinite 1 idle* liqidos-dean-node1 >>>> >>>> Apparently slurm thinks there are a bunch of jobs queued, but shows an >>>> empty queue. How do I get rid of these? >>>> >>>> If these zombie jobs aren't the problem what else could be keeping this >>>> from running? >>>> >>>> Thanks. >>>> >>> -- >>> -- >>> Carles Fenoy >>> >>