Hi, I'm using Slurm's elastic compute functionality to spin up nodes in the cloud, alongside a controller which is also in the cloud.
When executing a job, Slurm correctly places a node into the state "alloc#" and calls my resume program. My resume program successfully provisions the cloud node and slurmd comes up without a problem. My resume program then retrieves the ip address of my cloud node and updates the controller as follows: scontrol update nodename=foo nodeaddr=bar And then nothing happens! The node remains in the state "alloc#" until the ResumeTimeout is reached at which point the controller gives up. I'm fairly confident that slurmd is able to talk to the controller because if I specify an incorrect hostname for the controller in my slurm.conf, then slurmd immediately errors on startup and exits with a message saying something like "unable to contact controller" What am I missing? Thanks very much in advance if anybody has any ideas!