Hello everyone,
I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a cluster with 44 identical nodes but I have problem with slurmctld.service When I try to activate slurmctd I get the following message. fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files * Ubuntu 20.04.2 runs on the server and nodes in the exact same version. * munge 0.5.13 installed from Ubuntu repo running on server and nodes. * mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu)) installed from ubuntu repo running on server. slurm.conf is the same on all nodes and on server. slurmd.service is active and running on all nodes without problem. mysql.service is active and running on server. slurmdbd.service is active and running on server (slurm_acct_db created). Find attached slurm.conf slurmdbd.com and detailed output of slurmctld -Dvvvv command. Any hint? Thanks in advance jb
slurmctld: debug: Log file re-opened slurmctld: pidfile not locked, assuming no running daemon slurmctld: slurmctld version 19.05.5 started on cluster tuc slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so slurmctld: Munge credential signature plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so slurmctld: debug: Munge authentication plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_tres.so slurmctld: select/cons_tres loaded with argument 4372 slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_linear.so slurmctld: Linear node selection plugin loaded with argument 4372 slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cray_aries.so slurmctld: Cray/Aries node selection plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 4372 slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so slurmctld: debug: init: Gres GPU plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/preempt_none.so slurmctld: preempt/none loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so slurmctld: debug3: Success. slurmctld: debug: Checkpoint plugin loaded: checkpoint/none slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none.so slurmctld: debug: AcctGatherEnergy NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_none.so slurmctld: debug: AcctGatherProfile NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnect_none.so slurmctld: debug: AcctGatherInterconnect NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_none.so slurmctld: debug: AcctGatherFilesystem NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf) slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.so slurmctld: debug: Job accounting gather cgroup plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/ext_sensors_none.so slurmctld: ExtSensors NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so slurmctld: debug: switch NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug: power_save module disabled, SuspendTime < 0 slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_slurmdbd.so slurmctld: Accounting storage SLURMDBD plugin loaded slurmctld: debug3: Success. slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 104 of 255 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: persistent connection experienced an error slurmctld: error: Persistent Conn: only read 0 of 65385 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 104 of 255 bytes slurmctld: error: persistent connection experienced an error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: persistent connection experienced an error slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: error: persistent connection experienced an error slurmctld: error: Persistent Conn: only read 0 of 65385 bytes slurmctld: error: Persistent Conn: read: No error slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes slurmctld: debug: slurm_send_timeout: Socket POLLERR slurmctld: debug3: slurm_msg_sendto: peer has disappeared for msg_type=6500 slurmctld: error: slurm_persist_conn_open: failed to send persistent connection init message to se01.grid.tuc.gr:3306 slurmctld: error: slurm_persist_conn_open: No response to persist_init slurmctld: error: slurmdbd: Sending PersistInit msg: Transport endpoint is not connected slurmctld: debug4: slurmdbd: There is no state save file to open by name /var/spool/slurm/ctld/dbd.messages slurmctld: debug: Association database appears down, reading from state file. slurmctld: debug: create_mmap_buf: Failed to mmap file `/var/spool/slurm/ctld/last_tres`, No such device slurmctld: debug2: No last_tres file (/var/spool/slurm/ctld/last_tres) to recover slurmctld: debug: create_mmap_buf: Failed to mmap file `/var/spool/slurm/ctld/assoc_mgr_state`, No such device slurmctld: debug2: No association state file (/var/spool/slurm/ctld/assoc_mgr_state) to recover slurmctld: fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files.
slurm.conf
Description: Binary data
slurmdbd.conf
Description: Binary data