Hello, it seems that in a cluster configured for power saving, salloc does not wait until the nodes assigned to the job recover from the power down state and go back to normal operation
Although the job is in the state CONFIGURING and the node are still in IDLE+NOT_RESPONDING+POWERING_UP, the nodes are declared ready for the job and srun is invoked (on our cluster, salloc is configured for an interactive use. We have LaunchParameters=use_interactive_step in slurm.conf), which of course fails as the nodes are still booting. Is this the expected behavior of salloc ? Srun and sbatch work as expected. We use Slurm 22.05.3 > salloc --nodelist=taurus-n008 ...... salloc: Waiting for resource configuration salloc: Nodes taurus-n008 are ready for job srun: error: Task launch for StepId=766789.interactive failed on node taurus-n008: Communication connection failure srun: error: Application launch failed: Communication connection failure srun: Job step aborted salloc: Relinquishing job allocation 766789 > scontrol show nodes taurus-n008 ...... State=IDLE+NOT_RESPONDING+POWERING_UP .... > scontrol show job 766789 ..... JobState=CONFIGURING Reason=None Dependency=(null) NodeList=taurus-n008 Thank you & kind regards Gizo