I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code base. It's behavior is strange to say the least.
The controller was built from the same code base, but on Ubuntu 19.10. The controller reports the nodes state with sinfo, but can't run a simple job with srun because it thinks the node isn't available, even when it is idle. (And squeue shows an empty queue.) On the controller: $ srun -N 1 hostname srun: Required node not available (down, drained or reserved) srun: job 30 queued and waiting for resources ^Csrun: Job allocation 30 has been revoked srun: Force Terminated job 30 $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle* liqidos-dean-node1 $ squeue JOBID PARTITION USER ST TIME NODES NODELIST(REASON) When I try to run the simple job on the node I get: [liqid@liqidos-dean-node1 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle* liqidos-dean-node1 [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname srun: Required node not available (down, drained or reserved) srun: job 27 queued and waiting for resources ^Csrun: Job allocation 27 has been revoked [liqid@liqidos-dean-node1 ~]$ squeue JOBID PARTITION USER ST TIME NODES NODELIST(REASON) [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname srun: Required node not available (down, drained or reserved) srun: job 28 queued and waiting for resources ^Csrun: Job allocation 28 has been revoked [liqid@liqidos-dean-node1 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle* liqidos-dean-node1 Apparently slurm thinks there are a bunch of jobs queued, but shows an empty queue. How do I get rid of these? If these zombie jobs aren't the problem what else could be keeping this from running? Thanks.