Although, in testing, even with ReturnToService set to '1', on a restart the 
system sees the node has come back in the logs, but it is still classified as 
down so will not take jobs until manually told otherwise


[2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM01
[2020-11-30T10:33:05.402] debug2: node_did_resp srvgridslurm03
[2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM02

There has to be a way around this manual intervention

thanks

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Steve 
Bland
Sent: Monday, November 30, 2020 08:12
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity 
issue between the slurmctld process and the slurmd nodes

Thanks Chris

When I did that, they all came back.

Also found that in slurm.conf, ReturnToService was set to 0, so modified that 
for now. May turn it back to 0 to see if any nodes are lost, but I assume that 
will be in the log

Interestingly I had this in slurm.conf, thought that would make the initial 
state up for all

PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP


Steve Bland
Technical Product Manager
Third Party Products
Ross Video | Production Technology Experts
T: +1 (613) 228-0688 ext.4219
www.rossvideo.com<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.rossvideo.com%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cb8ed1faa8a834674670308d89531f492%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C0%7C637423389078612061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BZowNlheVAOKYa7cpTFi6VJx5Gf6iJ2T9n5Ug4kjxwk%3D&reserved=0>
________________________________
From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Chris Samuel <ch...@csamuel.org<mailto:ch...@csamuel.org>>
Sent: 27 November 2020 15:02
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
Subject: [EXTERNAL] Re: [slurm-users] trying to diagnose a connectivity issue 
between the slurmctld process and the slurmd nodes

On 26/11/20 9:21 am, Steve Bland wrote:

> Sinfo always returns nodes not responding

One thing - do the nodes return to this state when you resume them with
"scontrol update node=srvgridslurm[01-03] state=resume" ?

If they do then what does your slurmctld logs say for the reason for this?

You can bump up the log level on your slurmctld with (for instance
"scontrol setdebug debug" for more info (we run ours at debug all the
time anyway).

All the best,
Chris
--
Chris Samuel  :  
https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&amp;data=04%7C01%7Csbland%40rossvideo.com%7Cd08447ff5072423ef86f08d8930fa82d%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C1%7C637421042744008756%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=x5GjoV0mij7cMOciZv7w3wBH%2FEGONoV3i0fUDqoeRlI%3D&amp;reserved=0<https://can01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=04%7C01%7Csbland%40rossvideo.com%7Cb8ed1faa8a834674670308d89531f492%7C5d1f9dedbb98418c9ad2e1d24a9152a1%7C1%7C0%7C637423389078622059%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=QPAEm%2FzaZg%2FNKzwzRI4EqHRVHv%2FtQ3V3M4DwK%2B2R5Ck%3D&reserved=0>
  :  Berkeley, CA, USA
----------------------------------------------

This e-mail and any attachments may contain information that is confidential to 
Ross Video.

If you are not the intended recipient, please notify me immediately by replying 
to this message. Please also delete all copies. Thank you.
----------------------------------------------

This e-mail and any attachments may contain information that is confidential to 
Ross Video.

If you are not the intended recipient, please notify me immediately by replying 
to this message. Please also delete all copies. Thank you.

Reply via email to