Hello Everyone, In 19.05 and previous versions, I was able to run multiple nodes on the same virtual machine or container. While upgrading to 20.02.0, when I run sbatch to kick off a job, it is stuck in the CF (Configuring) state.
[root@slurmcluster log]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 6 normal wrap root CF 13:10 1 c1 The slurmctld.log file shows the following error, and it just loops thereon after with the same error message: ==> slurmctld.log <== [2020-03-22T13:53:28.917] debug2: Tree head got back 1 [2020-03-22T13:53:28.921] debug2: node_did_resp slurmcluster [2020-03-22T13:53:28.922] debug3: create_mmap_buf: loaded file `/var/spool/slurm/ctld/job_state` as Buf [2020-03-22T13:53:28.922] debug3: Writing job id 6 to header record of job_state file [2020-03-22T13:53:58.983] debug2: Testing job time limits and checkpoints [2020-03-22T13:53:58.983] error: _find_node_record(766): lookup failure for slurmcluster [2020-03-22T13:53:58.983] error: _find_node_record(778): lookup failure for slurmcluster alias slurmcluster [2020-03-22T13:54:28.071] debug2: Testing job time limits and checkpoints [2020-03-22T13:54:28.071] error: _find_node_record(766): lookup failure for slurmcluster [2020-03-22T13:54:28.071] error: _find_node_record(778): lookup failure for slurmcluster alias slurmcluster [2020-03-22T13:54:28.071] debug2: Performing purge of old job records [2020-03-22T13:54:28.071] debug: sched: Running job scheduler [2020-03-22T13:54:58.119] debug2: Testing job time limits and checkpoints [2020-03-22T13:54:58.119] error: _find_node_record(766): lookup failure for slurmcluster [2020-03-22T13:54:58.119] error: _find_node_record(778): lookup failure for slurmcluster alias slurmcluster I've tried manipulating the local /etc/hosts to make sure there wasn't a DNS problem of some kind, as the error message hints at. Here is a link to my slurm.conf: https://github.com/giovtorres/docker-centos7-slurm/blob/master/files/slurm/slurm.conf I saw that FastSchedule=2 was called out in the Release Notes and was deprecated. I am using FastSchedule=1. Is this deprecated as well? Has this behaviour changed? Sadly, the behaviour of FastSchedule is not documented anywhere. I'm not even sure that is the crux of the problem here. Any pointers would be greatly appreciated! Thanks, Giovanni