Hi;

I dont know the problem is this, but, I think the setting "ControlMachine=localhost" and not setting a hostname for slurm master node are not good decisions. How compute nodes decide the ip address of the slurm masternode from "localhost". Also, I suggest not using capital letters for any thing related the slurm.

Ahmet M.


15.12.2020 21:15 tarihinde Avery Grieve yazdı:
I changed my .service file to write to a log. The slurm daemons are running (manual start) on the compute nodes. I get this on startup with the service enabled:

[2020-12-15T18:09:06.412] slurmctld version 20.11.1 started on cluster cluster
[2020-12-15T18:09:06.539] No memory enforcing mechanism configured.
[2020-12-15T18:09:06.572] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve "FireNode1" [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0' not supported
[2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on FireNode1
[2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve "FireNode2" [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0' not supported
[2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on FireNode2
[2020-12-15T18:09:06.573] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-15T18:09:06.573] error: slurm_set_addr: Unable to resolve "FireNode3" [2020-12-15T18:09:06.573] error: slurm_get_port: Address family '0' not supported
[2020-12-15T18:09:06.573] error: _set_slurmd_addr: failure on FireNode3
[2020-12-15T18:09:06.578] Recovered state of 3 nodes
[2020-12-15T18:09:06.579] Recovered information about 0 jobs
[2020-12-15T18:09:06.582] Recovered state of 0 reservations
[2020-12-15T18:09:06.582] read_slurm_conf: backup_controller not specified
[2020-12-15T18:09:06.583] Running as primary controller
[2020-12-15T18:09:06.592] No parameter for mcs plugin, default values set
[2020-12-15T18:09:06.592] mcs: MCSParameters = (null). ondemand set.
[2020-12-15T18:09:06.595] error: get_addr_info: getaddrinfo() failed: Name or service not known [2020-12-15T18:09:06.595] error: slurm_set_addr: Unable to resolve "(null)" [2020-12-15T18:09:06.595] error: slurm_set_port: attempting to set port without address family [2020-12-15T18:09:06.603] error: Error creating slurm stream socket: Address family not supported by protocol [2020-12-15T18:09:06.603] fatal: slurm_init_msg_engine_port error Address family not supported by protocol

The main errors seem to be issues resolving host names and not being able to set the port. My /etc/hosts file defines the FireNode[1-3] host IPs and does not contain any IPv6 ips.

My service file includes a clause for "after network-online.target" as well.

Now, I start the daemon with "systemctl start slurmctld" and end up with the following log:

[2020-12-15T18:14:03.448] slurmctld version 20.11.1 started on cluster cluster
[2020-12-15T18:14:03.456] No memory enforcing mechanism configured.
[2020-12-15T18:14:03.465] Recovered state of 3 nodes
[2020-12-15T18:14:03.465] Recovered information about 0 jobs
[2020-12-15T18:14:03.465] Recovered state of 0 reservations
[2020-12-15T18:14:03.466] read_slurm_conf: backup_controller not specified
[2020-12-15T18:14:03.466] Running as primary controller
[2020-12-15T18:14:03.466] No parameter for mcs plugin, default values set
[2020-12-15T18:14:03.466] mcs: MCSParameters = (null). ondemand set.

As you can see, it starts up fine. Seems like something is wrong during the initial startup network stack configuration or something. I'm not really sure where to look to begin troubleshooting these. A bit of googling hasn't revealed much either unfortunately.

Any advice?

~Avery Grieve
They/Them/Theirs please!
University of Michigan


On Tue, Dec 15, 2020 at 11:53 AM Avery Grieve <agri...@umich.edu <mailto:agri...@umich.edu>> wrote:

    Maybe a silly question, but where do you find the daemon logs or
    specify their location?

    ~Avery Grieve
    They/Them/Theirs please!
    University of Michigan


    On Mon, Dec 14, 2020 at 7:22 PM Alpha Experiment
    <projectalpha...@gmail.com <mailto:projectalpha...@gmail.com>> wrote:

        Hi,

        I am trying to run slurm on Fedora 33. Upon boot the slurmd
        daemon is running correctly; however the slurmctld daemon
        always errors.
        [admin@localhost ~]$ systemctl status slurmd.service
        ● slurmd.service - Slurm node daemon
             Loaded: loaded (/etc/systemd/system/slurmd.service;
        enabled; vendor preset: disabled)
             Active: active (running) since Mon 2020-12-14 16:02:18
        PST; 11min ago
           Main PID: 2363 (slurmd)
              Tasks: 2
             Memory: 3.4M
                CPU: 211ms
             CGroup: /system.slice/slurmd.service
                     └─2363 /usr/local/sbin/slurmd -D
        Dec 14 16:02:18 localhost.localdomain systemd[1]: Started
        Slurm node daemon.
        [admin@localhost ~]$ systemctl status slurmctld.service
        ● slurmctld.service - Slurm controller daemon
             Loaded: loaded (/etc/systemd/system/slurmctld.service;
        enabled; vendor preset: disabled)
            Drop-In: /etc/systemd/system/slurmctld.service.d
                     └─override.conf
             Active: failed (Result: exit-code) since Mon 2020-12-14
        16:02:12 PST; 11min ago
            Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
        $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
           Main PID: 1972 (code=exited, status=1/FAILURE)
                CPU: 21ms
        Dec 14 16:02:12 localhost.localdomain systemd[1]: Started
        Slurm controller daemon.
        Dec 14 16:02:12 localhost.localdomain systemd[1]:
        slurmctld.service: Main process exited, code=exited,
        status=1/FAILURE
        Dec 14 16:02:12 localhost.localdomain systemd[1]:
        slurmctld.service: Failed with result 'exit-code'.

        The slurmctld log is as follows:
        [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on
        cluster cluster
        [2020-12-14T16:02:12.739] No memory enforcing mechanism
        configured.
        [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo()
        failed: Name or service not known
        [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to
        resolve "localhost"
        [2020-12-14T16:02:12.772] error: slurm_get_port: Address
        family '0' not supported
        [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on
        localhost
        [2020-12-14T16:02:12.772] Recovered state of 1 nodes
        [2020-12-14T16:02:12.772] Recovered information about 0 jobs
        [2020-12-14T16:02:12.772] select/cons_tres:
        part_data_create_array: select/cons_tres: preparing for 1
        partitions
        [2020-12-14T16:02:12.779] Recovered state of 0 reservations
        [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller
        not specified
        [2020-12-14T16:02:12.779] select/cons_tres:
        select_p_reconfigure: select/cons_tres: reconfigure
        [2020-12-14T16:02:12.779] select/cons_tres:
        part_data_create_array: select/cons_tres: preparing for 1
        partitions
        [2020-12-14T16:02:12.779] Running as primary controller
        [2020-12-14T16:02:12.780] No parameter for mcs plugin, default
        values set
        [2020-12-14T16:02:12.780] mcs: MCSParameters = (null).
        ondemand set.
        [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo()
        failed: Name or service not known
        [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to
        resolve "(null)"
        [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to
        set port without address family
        [2020-12-14T16:02:12.782] error: Error creating slurm stream
        socket: Address family not supported by protocol
        [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port
        error Address family not supported by protocol

        Strangely, the daemon works fine when it is rebooted. After
        running
        systemctl restart slurmctld.service

        the service status is
        [admin@localhost ~]$ systemctl status slurmctld.service
        ● slurmctld.service - Slurm controller daemon
             Loaded: loaded (/etc/systemd/system/slurmctld.service;
        enabled; vendor preset: disabled)
            Drop-In: /etc/systemd/system/slurmctld.service.d
                     └─override.conf
             Active: active (running) since Mon 2020-12-14 16:14:24
        PST; 3s ago
           Main PID: 2815 (slurmctld)
              Tasks: 7
             Memory: 1.9M
                CPU: 15ms
             CGroup: /system.slice/slurmctld.service
                     └─2815 /usr/local/sbin/slurmctld -D
        Dec 14 16:14:24 localhost.localdomain systemd[1]: Started
        Slurm controller daemon.

        Could anyone point me towards how to fix this? I expect it's
        just an issue with my configuration file, which I've copied
        below for reference.
        # slurm.conf file generated by configurator easy.html.
        # Put this file on all nodes of your cluster.
        # See the slurm.conf man page for more information.
        #
        #SlurmctldHost=localhost
        ControlMachine=localhost
        #
        #MailProg=/bin/mail
        MpiDefault=none
        #MpiParams=ports=#-#
        ProctrackType=proctrack/cgroup
        ReturnToService=1
        SlurmctldPidFile=/home/slurm/run/slurmctld.pid
        #SlurmctldPort=6817
        SlurmdPidFile=/home/slurm/run/slurmd.pid
        #SlurmdPort=6818
        SlurmdSpoolDir=/var/spool/slurm/slurmd/
        SlurmUser=slurm
        #SlurmdUser=root
        StateSaveLocation=/home/slurm/spool/
        SwitchType=switch/none
        TaskPlugin=task/affinity
        #
        #
        # TIMERS
        #KillWait=30
        #MinJobAge=300
        #SlurmctldTimeout=120
        #SlurmdTimeout=300
        #
        #
        # SCHEDULING
        SchedulerType=sched/backfill
        SelectType=select/cons_tres
        SelectTypeParameters=CR_Core
        #
        #
        # LOGGING AND ACCOUNTING
        AccountingStorageType=accounting_storage/none
        ClusterName=cluster
        #JobAcctGatherFrequency=30
        JobAcctGatherType=jobacct_gather/none
        #SlurmctldDebug=info
        SlurmctldLogFile=/home/slurm/log/slurmctld.log
        #SlurmdDebug=info
        #SlurmdLogFile=
        #
        #
        # COMPUTE NODES
        NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1
        CoresPerSocket=64 ThreadsPerCore=2 State=UNKNOWN
        PartitionName=full Nodes=localhost Default=YES
        MaxTime=INFINITE State=UP

        Thanks!
        -John


Reply via email to