If I run sinfo on the node itself it shows an asterisk. How can the node be unreachable from itself?
On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <mini...@gmail.com> wrote: > Hi, > > The * next to the idle status in sinfo means that the node is > unreachable/not responding. Check the status of the slurmd on the node and > check the connectivity from the slurmctld host to the compute node (telnet > may be enough). You can also check the slurmctld logs for more information. > > Regards, > Carlos > > On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schu...@gmail.com> > wrote: > >> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 >> code base. It's behavior is strange to say the least. >> >> The controller was built from the same code base, but on Ubuntu 19.10. >> The controller reports the nodes state with sinfo, but can't run a simple >> job with srun because it thinks the node isn't available, even when it is >> idle. (And squeue shows an empty queue.) >> >> On the controller: >> $ srun -N 1 hostname >> srun: Required node not available (down, drained or reserved) >> srun: job 30 queued and waiting for resources >> ^Csrun: Job allocation 30 has been revoked >> srun: Force Terminated job 30 >> $ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 1 idle* liqidos-dean-node1 >> $ squeue >> JOBID PARTITION USER ST TIME NODES >> NODELIST(REASON) >> >> >> When I try to run the simple job on the node I get: >> >> [liqid@liqidos-dean-node1 ~]$ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 1 idle* liqidos-dean-node1 >> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname >> srun: Required node not available (down, drained or reserved) >> srun: job 27 queued and waiting for resources >> ^Csrun: Job allocation 27 has been revoked >> [liqid@liqidos-dean-node1 ~]$ squeue >> JOBID PARTITION USER ST TIME NODES >> NODELIST(REASON) >> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname >> srun: Required node not available (down, drained or reserved) >> srun: job 28 queued and waiting for resources >> ^Csrun: Job allocation 28 has been revoked >> [liqid@liqidos-dean-node1 ~]$ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> debug* up infinite 1 idle* liqidos-dean-node1 >> >> Apparently slurm thinks there are a bunch of jobs queued, but shows an >> empty queue. How do I get rid of these? >> >> If these zombie jobs aren't the problem what else could be keeping this >> from running? >> >> Thanks. >> > -- > -- > Carles Fenoy >