Re: [slurm-users] status of cloud nodes

2019-06-20 Thread nathan norton
On 20/6/19 3:24 am, Brian Andrus wrote: Can you give the exact command/output you have from this? I suspect a typo in your slurm.conf for nodenames or what you are typing. Brian Andrus Hi Brian, I am pretty sure there is no error in my typing of the commands, but just in case find below t

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Christopher Samuel
On 6/18/19 11:29 PM, nathan norton wrote: Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Not quite, they exist internally but are not exposed until in use: https://slurm.schedmd.com/elastic_computing.html

Re: [slurm-users] status of cloud nodes

2019-06-19 Thread Brian Andrus
Can you give the exact command/output you have from this? I suspect a typo in your slurm.conf for nodenames or what you are typing. Brian Andrus On 6/18/2019 11:29 PM, nathan norton wrote: Hi, It just shows "Node $NODE not found" Whereas others all work as expected (ie, they are running) W

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread nathan norton
Hi, It just shows "Node $NODE not found" Whereas others all work as expected (ie, they are running) Without knowing the internals of slurm it feels like nodes that are turned off+cloud state don't exist in the system until they are on? Any other ideas? Thanks Nathan On Wed., 19 Jun. 2019, 4:

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread Chris Samuel
On Tuesday, 18 June 2019 9:36:56 PM PDT nathan norton wrote: > Just tried running that command, but it only shows nodes that are up and > running, doesn’t tell me about any nodes that are down and turned off, as > an example please see below. There is a job running that should be using > the 100 n

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread nathan norton
Hi, Just tried running that command, but it only shows nodes that are up and running, doesn’t tell me about any nodes that are down and turned off, as an example please see below. There is a job running that should be using the 100 nodes but only 52 are allocated (plus 2 down* (that I know about

Re: [slurm-users] status of cloud nodes

2019-06-18 Thread Sam Gallop (NBI)
Hi Nathan, The command I use to get the reason for failed nodes is ... 'sinfo -Ral'. If you need to extend the width of the output then ... 'sinfo -Ral -O reason:35,user,timestamp,statelong,nodelist'. Using the timestamp of the failure look in the slurmd or slurmctld logs. --- Sam Gallop