Hi, I have been maintaining a Slurm simulator <https://hub.docker.com/repository/registry-1.docker.io/hpcnow/slurm_simulator/general> for ages. I have everything automated in other to try new features and keep my configuration up to date, version after version. Unfortunately, from version 21, the front-end mode makes the slurmd daemon crash with the following error message:
slurmd: error: _find_node_record: lookup failure for node "slurm-simulator" slurmd: error: _find_node_record: lookup failure for node "slurm-simulator", alias "slurm-simulator" slurmd: error: slurmd initialization failed The exact same container, with the same configuration but using version 20.11.9, works just fine. I reproduce the same steps manually in a VM to remove the noise introduced by the container, but the result is the same. The attached configuration is available in the container. [root@slurm-simulator /]# cat /etc/slurm/slurm.conf ClusterName=simulator SlurmctldHost=slurm-simulator FrontendName=slurm-simulator MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core AccountingStorageType=accounting_storage/slurmdbd JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log SlurmdParameters=config_overrides include /etc/slurm/nodes.conf include /etc/slurm/partitions.conf [root@slurm-simulator /]# cat /etc/slurm/nodes.conf NodeName=node[001-10] RealMemory=248000 Sockets=2 CoresPerSocket=32 ThreadsPerCore=1 State=UNKNOWN NodeAddr=slurm-simulator NodeHostName=slurm-simulator [root@slurm-simulator /]# cat /etc/slurm/partitions.conf PartitionName=long Nodes=node[001-10] Default=YES State=UP OverSubscribe=NO MaxTime=14-00:00:00 The error can be reproduced by running the following commands: docker run --rm --detach \ --name "${USER}_simulator" \ -h "slurm-simulator" \ --security-opt seccomp:unconfined \ --privileged -e container=docker \ -v /run -v /sys/fs/cgroup:/sys/fs/cgroup \ --cgroupns=host \ hpcnow/slurm_simulator:21.08.8-2 /usr/sbin/init docker exec -ti ${USER}_simulator /bin/bash slurmd -D -vvvvv If you try the same command with v20.11.9 it will work. I have tried using the new SlurmdParameters=config_overrides option, but I still get the same problem. Any ideas or suggestions? Thanks! On Mon, 11 Jul 2022 at 23:21, Jordi Blasco <jbllis...@gmail.com> wrote: > Thank Ole, > > I checked the /etc/nsswitch.conf and I have even setup a dnsmasq service, > just in case. > > [root@slurm-simulator /]# cat /etc/nsswitch.conf | grep hosts > # Valid databases are: aliases, ethers, group, gshadow, hosts, > hosts: files dns myhostname > > [root@slurm-simulator /]# ping slurm-simulator -c 1 > PING slurm-simulator (172.17.0.4) 56(84) bytes of data. > 64 bytes from slurm-simulator (172.17.0.4): icmp_seq=1 ttl=64 time=0.022 ms > > --- slurm-simulator ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.022/0.022/0.022/0.000 ms > > [root@slurm-simulator /]# cat /etc/resolv.conf | grep -v "^#" > nameserver 172.17.0.4 > nameserver 172.31.0.2 > search eu-west-3.compute.internal > [root@slurm-simulator /]# host slurm-simulator > slurm-simulator has address 172.17.0.4 > [root@slurm-simulator /]# host 172.17.0.4 > 4.0.17.172.in-addr.arpa domain name pointer slurm-simulator. > > > Regards, > > Jordi > > > > On Mon, 11 Jul 2022 at 23:09, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> > wrote: > >> On 7/11/22 12:54, Jordi Blasco wrote: >> > I use the front-end node mode >> > <https://slurm.schedmd.com/faq.html#multi_slurmd> to emulate a real >> > cluster in order to validate the Slurm configuration in a Docker >> container >> > and develop custom plugins. With versions 21.08.8-2 and 22.05.2, slurmd >> is >> > complaining about not being able to find the frontend node. >> > >> > slurmd -D -vvv >> > ... >> > slurmd: error: _find_node_record: lookup failure for node >> "slurm-simulator" >> > slurmd: error: _find_node_record: lookup failure for node >> > "slurm-simulator", alias "slurm-simulator" >> > slurmd: error: slurmd initialization failed >> >> This could be a DNS lookup issue. Can you ping the node named >> "slurm-simulator"? >> >> /Ole >> >>