Hi Brian, My hosts file looks like this: 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
I believe the second is an IPV6 address. Is it safe to delete that line? Best, John On Mon, Dec 14, 2020 at 11:10 PM Brian Andrus <toomuc...@gmail.com> wrote: > > Check your hosts file and ensure 'localhost' does not have an IPV6 > address associated with it. > > Brian Andrus > > On 12/14/2020 4:19 PM, Alpha Experiment wrote: > > Hi, > > > > I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is > > running correctly; however the slurmctld daemon always errors. > > [admin@localhost ~]$ systemctl status slurmd.service > > ● slurmd.service - Slurm node daemon > > Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; > > vendor preset: disabled) > > Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago > > Main PID: 2363 (slurmd) > > Tasks: 2 > > Memory: 3.4M > > CPU: 211ms > > CGroup: /system.slice/slurmd.service > > └─2363 /usr/local/sbin/slurmd -D > > Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node > > daemon. > > [admin@localhost ~]$ systemctl status slurmctld.service > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; > > vendor preset: disabled) > > Drop-In: /etc/systemd/system/slurmctld.service.d > > └─override.conf > > Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 > > PST; 11min ago > > Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D > > $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE) > > Main PID: 1972 (code=exited, status=1/FAILURE) > > CPU: 21ms > > Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm > > controller daemon. > > Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: > > Main process exited, code=exited, status=1/FAILURE > > Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: > > Failed with result 'exit-code'. > > > > The slurmctld log is as follows: > > [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster > > cluster > > [2020-12-14T16:02:12.739] No memory enforcing mechanism configured. > > [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: > > Name or service not known > > [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve > > "localhost" > > [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' > > not supported > > [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost > > [2020-12-14T16:02:12.772] Recovered state of 1 nodes > > [2020-12-14T16:02:12.772] Recovered information about 0 jobs > > [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: > > select/cons_tres: preparing for 1 partitions > > [2020-12-14T16:02:12.779] Recovered state of 0 reservations > > [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified > > [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: > > select/cons_tres: reconfigure > > [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: > > select/cons_tres: preparing for 1 partitions > > [2020-12-14T16:02:12.779] Running as primary controller > > [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set > > [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set. > > [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: > > Name or service not known > > [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve > > "(null)" > > [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set > > port without address family > > [2020-12-14T16:02:12.782] error: Error creating slurm stream socket: > > Address family not supported by protocol > > [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error > > Address family not supported by protocol > > > > Strangely, the daemon works fine when it is rebooted. After running > > systemctl restart slurmctld.service > > > > the service status is > > [admin@localhost ~]$ systemctl status slurmctld.service > > ● slurmctld.service - Slurm controller daemon > > Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; > > vendor preset: disabled) > > Drop-In: /etc/systemd/system/slurmctld.service.d > > └─override.conf > > Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago > > Main PID: 2815 (slurmctld) > > Tasks: 7 > > Memory: 1.9M > > CPU: 15ms > > CGroup: /system.slice/slurmctld.service > > └─2815 /usr/local/sbin/slurmctld -D > > Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm > > controller daemon. > > > > Could anyone point me towards how to fix this? I expect it's just an > > issue with my configuration file, which I've copied below for reference. > > # slurm.conf file generated by configurator easy.html. > > # Put this file on all nodes of your cluster. > > # See the slurm.conf man page for more information. > > # > > #SlurmctldHost=localhost > > ControlMachine=localhost > > # > > #MailProg=/bin/mail > > MpiDefault=none > > #MpiParams=ports=#-# > > ProctrackType=proctrack/cgroup > > ReturnToService=1 > > SlurmctldPidFile=/home/slurm/run/slurmctld.pid > > #SlurmctldPort=6817 > > SlurmdPidFile=/home/slurm/run/slurmd.pid > > #SlurmdPort=6818 > > SlurmdSpoolDir=/var/spool/slurm/slurmd/ > > SlurmUser=slurm > > #SlurmdUser=root > > StateSaveLocation=/home/slurm/spool/ > > SwitchType=switch/none > > TaskPlugin=task/affinity > > # > > # > > # TIMERS > > #KillWait=30 > > #MinJobAge=300 > > #SlurmctldTimeout=120 > > #SlurmdTimeout=300 > > # > > # > > # SCHEDULING > > SchedulerType=sched/backfill > > SelectType=select/cons_tres > > SelectTypeParameters=CR_Core > > # > > # > > # LOGGING AND ACCOUNTING > > AccountingStorageType=accounting_storage/none > > ClusterName=cluster > > #JobAcctGatherFrequency=30 > > JobAcctGatherType=jobacct_gather/none > > #SlurmctldDebug=info > > SlurmctldLogFile=/home/slurm/log/slurmctld.log > > #SlurmdDebug=info > > #SlurmdLogFile= > > # > > # > > # COMPUTE NODES > > NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 > > CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN > > PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP > > > > Thanks! > > -John