Hello, 

it seems that in a cluster configured for power saving, salloc does not wait 
until the nodes 
assigned to the job recover from the power down state and go back to normal 
operation

Although the job is in the state CONFIGURING and the node are still in 
IDLE+NOT_RESPONDING+POWERING_UP,
the nodes are declared ready for the job and srun is invoked (on our cluster, 
salloc is configured 
for an interactive use. We have LaunchParameters=use_interactive_step in 
slurm.conf), 
which of course fails as the nodes are still booting.

Is this the expected behavior of salloc ?

Srun and sbatch work as expected.

We use Slurm 22.05.3

> salloc --nodelist=taurus-n008
......
salloc: Waiting for resource configuration
salloc: Nodes taurus-n008 are ready for job
srun: error: Task launch for StepId=766789.interactive failed on node 
taurus-n008: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted
salloc: Relinquishing job allocation 766789

> scontrol show nodes taurus-n008
......
State=IDLE+NOT_RESPONDING+POWERING_UP
....

> scontrol show job 766789
.....
JobState=CONFIGURING Reason=None Dependency=(null)
NodeList=taurus-n008

Thank you & kind regards
Gizo

Reply via email to