Hello everyone,

 

I installed the slurm 19.05.5 from Ubuntu repo,  for the first time in a
cluster with 44  identical nodes but I have problem with slurmctld.service

 

When I try to activate slurmctd I get the following message.

 

fatal: You are running with a database but for some reason we have no TRES
from it.  This should only happen if the database is down and you don't have
any state files

 

*       Ubuntu 20.04.2 runs on the server and nodes in the exact same
version.
*       munge 0.5.13 installed from Ubuntu repo running on server and nodes.
*       mysql  Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
installed from ubuntu repo running on server.

 

slurm.conf is the same on all nodes and on server.

 

slurmd.service is active and running on all nodes without problem.

 

mysql.service is active and running on server.

slurmdbd.service is active and running on server (slurm_acct_db created).

 

Find attached slurm.conf slurmdbd.com  and detailed output of slurmctld
-Dvvvv  command.

 

Any hint?

 

Thanks in advance

 

jb

 

 

 

slurmctld: debug:  Log file re-opened
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: slurmctld version 19.05.5 started on cluster tuc
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so
slurmctld: Munge credential signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so
slurmctld: debug:  Munge authentication plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_tres.so
slurmctld: select/cons_tres loaded with argument 4372
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/select_linear.so
slurmctld: Linear node selection plugin loaded with argument 4372
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/select_cray_aries.so
slurmctld: Cray/Aries node selection plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 
4372
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so
slurmctld: debug:  init: Gres GPU plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: debug:  Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none.so
slurmctld: debug:  AcctGatherEnergy NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_none.so
slurmctld: debug:  AcctGatherProfile NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnect_none.so
slurmctld: debug:  AcctGatherInterconnect NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_none.so
slurmctld: debug:  AcctGatherFilesystem NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.so
slurmctld: debug:  Job accounting gather cgroup plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/ext_sensors_none.so
slurmctld: ExtSensors NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so
slurmctld: debug:  switch NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug:  power_save module disabled, SuspendTime < 0
slurmctld: debug3: Trying to load plugin 
/usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_slurmdbd.so
slurmctld: Accounting storage SLURMDBD plugin loaded
slurmctld: debug3: Success.
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 104 of 255 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: persistent connection experienced an error
slurmctld: error: Persistent Conn: only read 0 of 65385 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 104 of 255 bytes
slurmctld: error: persistent connection experienced an error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: persistent connection experienced an error
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: error: persistent connection experienced an error
slurmctld: error: Persistent Conn: only read 0 of 65385 bytes
slurmctld: error: Persistent Conn: read: No error
slurmctld: error: Persistent Conn: only read 105 of 1761607680 bytes
slurmctld: debug:  slurm_send_timeout: Socket POLLERR
slurmctld: debug3: slurm_msg_sendto: peer has disappeared for msg_type=6500
slurmctld: error: slurm_persist_conn_open: failed to send persistent connection 
init message to se01.grid.tuc.gr:3306
slurmctld: error: slurm_persist_conn_open: No response to persist_init
slurmctld: error: slurmdbd: Sending PersistInit msg: Transport endpoint is not 
connected
slurmctld: debug4: slurmdbd: There is no state save file to open by name 
/var/spool/slurm/ctld/dbd.messages
slurmctld: debug:  Association database appears down, reading from state file.
slurmctld: debug:  create_mmap_buf: Failed to mmap file 
`/var/spool/slurm/ctld/last_tres`, No such device
slurmctld: debug2: No last_tres file (/var/spool/slurm/ctld/last_tres) to 
recover
slurmctld: debug:  create_mmap_buf: Failed to mmap file 
`/var/spool/slurm/ctld/assoc_mgr_state`, No such device
slurmctld: debug2: No association state file 
(/var/spool/slurm/ctld/assoc_mgr_state) to recover
slurmctld: fatal: You are running with a database but for some reason we have 
no TRES from it.  This should only happen if the database is down and you don't 
have any state files.

Attachment: slurm.conf
Description: Binary data

Attachment: slurmdbd.conf
Description: Binary data

Reply via email to