The node is not getting the status from itself, it’s querying the slurmctld to ask for its status.
-- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Jan 20, 2020, at 3:56 PM, Dean Schulze <dean.w.schu...@gmail.com> wrote: > > If I run sinfo on the node itself it shows an asterisk. How can the node be > unreachable from itself? > > On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <mini...@gmail.com> wrote: > Hi, > > The * next to the idle status in sinfo means that the node is unreachable/not > responding. Check the status of the slurmd on the node and check the > connectivity from the slurmctld host to the compute node (telnet may be > enough). You can also check the slurmctld logs for more information. > > Regards, > Carlos > > On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schu...@gmail.com> wrote: > I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code > base. It's behavior is strange to say the least. > > The controller was built from the same code base, but on Ubuntu 19.10. The > controller reports the nodes state with sinfo, but can't run a simple job > with srun because it thinks the node isn't available, even when it is idle. > (And squeue shows an empty queue.) > > On the controller: > $ srun -N 1 hostname > srun: Required node not available (down, drained or reserved) > srun: job 30 queued and waiting for resources > ^Csrun: Job allocation 30 has been revoked > srun: Force Terminated job 30 > $ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 1 idle* liqidos-dean-node1 > $ squeue > JOBID PARTITION USER ST TIME NODES > NODELIST(REASON) > > > When I try to run the simple job on the node I get: > > [liqid@liqidos-dean-node1 ~]$ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 1 idle* liqidos-dean-node1 > [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname > srun: Required node not available (down, drained or reserved) > srun: job 27 queued and waiting for resources > ^Csrun: Job allocation 27 has been revoked > [liqid@liqidos-dean-node1 ~]$ squeue > JOBID PARTITION USER ST TIME NODES > NODELIST(REASON) > [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname > srun: Required node not available (down, drained or reserved) > srun: job 28 queued and waiting for resources > ^Csrun: Job allocation 28 has been revoked > [liqid@liqidos-dean-node1 ~]$ sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 1 idle* liqidos-dean-node1 > > Apparently slurm thinks there are a bunch of jobs queued, but shows an empty > queue. How do I get rid of these? > > If these zombie jobs aren't the problem what else could be keeping this from > running? > > Thanks. > -- > -- > Carles Fenoy