I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1
code base.  It's behavior is strange to say the least.

The controller was built from the same code base, but on Ubuntu 19.10.  The
controller reports the nodes state with sinfo, but can't run a simple job
with srun because it thinks the node isn't available, even when it is
idle.  (And squeue shows an empty queue.)

On the controller:
$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 30 queued and waiting for resources
^Csrun: Job allocation 30 has been revoked
srun: Force Terminated job 30
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
$ squeue
             JOBID  PARTITION      USER  ST        TIME   NODES
NODELIST(REASON)


When I try to run the simple job on the node I get:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 27 queued and waiting for resources
^Csrun: Job allocation 27 has been revoked
[liqid@liqidos-dean-node1 ~]$ squeue
             JOBID  PARTITION      USER  ST        TIME   NODES
NODELIST(REASON)
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
srun: Required node not available (down, drained or reserved)
srun: job 28 queued and waiting for resources
^Csrun: Job allocation 28 has been revoked
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

Apparently slurm thinks there are a bunch of jobs queued, but shows an
empty queue.  How do I get rid of these?

If these zombie jobs aren't the problem what else could be keeping this
from running?

Thanks.

Reply via email to