Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Ryan Novosielski Mon, 20 Jan 2020 20:20:35 -0800

The node is not getting the status from itself, it’s querying the slurmctld to 
ask for its status.


--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

> On Jan 20, 2020, at 3:56 PM, Dean Schulze <dean.w.schu...@gmail.com> wrote:
> 
> If I run sinfo on the node itself it shows an asterisk.  How can the node be 
> unreachable from itself?
> 
> On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <mini...@gmail.com> wrote:
> Hi,
> 
> The * next to the idle status in sinfo means that the node is unreachable/not 
> responding. Check the status of the slurmd on the node and check the 
> connectivity from the slurmctld host to the compute node (telnet may be 
> enough). You can also check the slurmctld logs for more information. 
> 
> Regards,
> Carlos
> 
> On Mon, 20 Jan 2020 at 21:04, Dean Schulze <dean.w.schu...@gmail.com> wrote:
> I've got a node running on CentOS 7.7 build from the recent 20.02.0pre1 code 
> base.  It's behavior is strange to say the least.
> 
> The controller was built from the same code base, but on Ubuntu 19.10.  The 
> controller reports the nodes state with sinfo, but can't run a simple job 
> with srun because it thinks the node isn't available, even when it is idle.  
> (And squeue shows an empty queue.)
> 
> On the controller:
> $ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 30 queued and waiting for resources
> ^Csrun: Job allocation 30 has been revoked
> srun: Force Terminated job 30
> $ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*       up   infinite      1  idle* liqidos-dean-node1 
> $ squeue
>              JOBID  PARTITION      USER  ST        TIME   NODES 
> NODELIST(REASON) 
> 
> 
> When I try to run the simple job on the node I get:
> 
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*       up   infinite      1  idle* liqidos-dean-node1 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 27 queued and waiting for resources
> ^Csrun: Job allocation 27 has been revoked
> [liqid@liqidos-dean-node1 ~]$ squeue
>              JOBID  PARTITION      USER  ST        TIME   NODES 
> NODELIST(REASON) 
> [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
> srun: Required node not available (down, drained or reserved)
> srun: job 28 queued and waiting for resources
> ^Csrun: Job allocation 28 has been revoked
> [liqid@liqidos-dean-node1 ~]$ sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST 
> debug*       up   infinite      1  idle* liqidos-dean-node1 
> 
> Apparently slurm thinks there are a bunch of jobs queued, but shows an empty 
> queue.  How do I get rid of these?
> 
> If these zombie jobs aren't the problem what else could be keeping this from 
> running?
> 
> Thanks.
> -- 
> --
> Carles Fenoy

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

Reply via email to