Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-21 Thread Dean Schulze
Thank you, thank you, thank you. It was the firewall on CentOS 7. Once I disabled that it worked. For anyone else who runs into this issue here is how to disable the firewall on CentOS 7: https://linuxize.com/post/how-to-stop-and-disable-firewalld-on-centos-7/ On Tue, Jan 21, 2020 at 7:24 AM

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-21 Thread Brian Johanson
On 1/21/2020 12:32 AM, Chris Samuel wrote: On 20/1/20 3:00 pm, Dean Schulze wrote: There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to see

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Chris Samuel
On 20/1/20 3:00 pm, Dean Schulze wrote: There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7.  I'm downgrading to a stable 19.05 build to see if that solves the problem. I've run the ma

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Ryan Novosielski
The node is not getting the status from itself, it’s querying the slurmctld to ask for its status. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to see if that solves the problem. On Mon, Jan 20, 2020 at 3:41 PM Carlos Fenoy wrote: > It se

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
It seems to me that the problem is between the slurmctld and slurmd. When slurmd starts it sends a message to the slurmctld, that's why it appears idle. Every now and then the slurmctld will try to ping the slurmd to check if it's still alive. This ping doesn't seem to be working, so as I mentioned

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Brian Andrus
Check the slurmd log file on the node. Ensure slurmd is still running. Sounds possible that OOM Killer or such may be killing slurmd Brian Andrus On 1/20/2020 1:12 PM, Dean Schulze wrote: If I restart slurmd the asterisk goes away.  Then I can run the job once and the asterisk is back, and t

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I restart slurmd the asterisk goes away. Then I can run the job once and the asterisk is back, and the node remains in comp*: [liqid@liqidos-dean-node1 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle liqidos-dean-node1 [liqid@liqidos-dean-no

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Dean Schulze
If I run sinfo on the node itself it shows an asterisk. How can the node be unreachable from itself? On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy wrote: > Hi, > > The * next to the idle status in sinfo means that the node is > unreachable/not responding. Check the status of the slurmd on the no

Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle

2020-01-20 Thread Carlos Fenoy
Hi, The * next to the idle status in sinfo means that the node is unreachable/not responding. Check the status of the slurmd on the node and check the connectivity from the slurmctld host to the compute node (telnet may be enough). You can also check the slurmctld logs for more information. Regar