Hello,
Slurm Power Saving (19.05.) was configured successfuly within our Cloud 
environment. Jobs can be submitted and nodes get provisioned and deprovisioned 
as expected. Unfortunately there seems to be an edge case (or config issue 
:-D).After a job (jobA) is submitted to partition A, node provisioning starts, 
during that phase another job (jobB) is submitted to the partition including 
requesting the same node (-w) - not sure if this is really a must have right 
now. The edge case is based on application job scheduling.
Unfortunately jobB runs before jobA and fails, but few seconds after jobA 
finishes successfully. Therefore the configuration should be ok - overall.
srun: error: Unable to resolve "mynodename": Host name lookup failuresrun: 
error: fwd_tree_thread: can't find address for host mynodename check 
slurm.confsrun: error: Task launch for 123456.0 failed on node mynodename: 
Can't find an address, check slurm.confsrun: error: Application launch failed: 
Can't find an address, check slurm.confsrun: Job step aborted: Waiting up to 
188 seconds for job step to finish.srun: error: Timed out waiting for job step 
to complete
It looks like slurmctld applies some magic to jobA (Resetting JobId=jobidA 
start time for node power up) but not to jobB.
update_node: node mynodename state set to ALLOCATEDNode mynodename2 now 
respondingNode mynodename now respondingupdate_node: node mynodename state set 
to ALLOCATED_pick_step_nodes: Configuration for JobId=jobidB is 
completejob_step_signal: JobId=jobidB StepId=0 not found_pick_step_nodes: 
Configuration for JobId=jobidA is completeResetting JobId=jobidA start time for 
node power up_job_complete: JobId=jobidA WEXITSTATUS 0_job_complete: 
JobId=jobidA donejob_step_signal: JobId=jobidB StepId=0 not found_job_complete: 
JobId=jobidB WTERMSIG 116_job_complete: JobId=jobidB done

Has anyone seen this before or any idea how to fix it?


Thanks & Best
Eg. Bo.

Reply via email to