Hi list,


There is a problem when dealing with Slurm's high availability. 
Now, In my env, I store the state file in the local hard disk for Ctld nodes, 
and use a shell script referencing the output of "scontrol ping" to sync files 
with interval time (2s, if making the time shorter then it will influence the 
server throughput),


When I test Slurm HA, found it will use about configured time in slurm.conf to 
do the HA action by heartbeat method, 
but it will cost between 2.5s to 3s, with the command "scontrol takeover 1".


The shell script method will work well in scenario 1. 
But In the second scenario,  I found it is not a good way for synchronizing the 
state file from the main Ctld to the new main Ctld.


I have several questions at below:
1. what's your favorite way to do HA dealing with state files? On the Slurm 
website, I did not find useful messages.
2. what's the best way with a shell script to sync state files? I go through 
the code about parameters of "SlurmctldPrimaryOffProg" and 
"SlurmctldPrimaryOnProg", found the OffProg is better to do do the last time 
sync operation, is my idea ok for this scenario? 




Thanks



 





 

Reply via email to