Hi Sean,
Thank you for your prompt response, I made the changes you suggested, slurmctld refuse running……. find attached new slurmctld -Dvvvv jb From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Sean Crosby Sent: Monday, April 5, 2021 11:46 AM To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] [EXT] slurmctld error Hi Jb, You have set AccountingStoragePort to 3306 in slurm.conf, which is the MySQL port running on the DBD host. AccountingStoragePort is the port for the Slurmdbd service, and not for MySQL. Change AccountingStoragePort to 6819 and it should fix your issues. I also think you should comment out the lines AccountingStorageUser=slurm AccountingStoragePass=/run/munge/munge.socket.2 You shouldn't need those lines Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibot...@isc.tuc.gr <mailto:ibot...@isc.tuc.gr> > wrote: UoM notice: External email. Be cautious of links, attachments, or impersonation attempts _____ Hello everyone, I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a cluster with 44 identical nodes but I have problem with slurmctld.service When I try to activate slurmctd I get the following message… fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files * Ubuntu 20.04.2 runs on the server and nodes in the exact same version. * munge 0.5.13 installed from Ubuntu repo running on server and nodes. * mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu)) installed from ubuntu repo running on server. slurm.conf is the same on all nodes and on server. slurmd.service is active and running on all nodes without problem. mysql.service is active and running on server. slurmdbd.service is active and running on server (slurm_acct_db created). Find attached slurm.conf slurmdbd.com <http://slurmdbd.com> and detailed output of slurmctld -Dvvvv command. Any hint? Thanks in advance jb
slurmctld: debug: Log file re-opened slurmctld: slurmctld version 19.05.5 started on cluster tuc slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/cred_munge.so slurmctld: Munge credential signature plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/auth_munge.so slurmctld: debug: Munge authentication plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_tres.so slurmctld: select/cons_tres loaded with argument 4372 slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_linear.so slurmctld: Linear node selection plugin loaded with argument 4372 slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cray_aries.so slurmctld: Cray/Aries node selection plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/select_cons_res.so slurmctld: Consumable Resources (CR) Node Selection plugin loaded with argument 4372 slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/gres_gpu.so slurmctld: debug: init: Gres GPU plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/preempt_none.so slurmctld: preempt/none loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/checkpoint_none.so slurmctld: debug3: Success. slurmctld: debug: Checkpoint plugin loaded: checkpoint/none slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_energy_none.so slurmctld: debug: AcctGatherEnergy NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_profile_none.so slurmctld: debug: AcctGatherProfile NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_interconnect_none.so slurmctld: debug: AcctGatherInterconnect NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/acct_gather_filesystem_none.so slurmctld: debug: AcctGatherFilesystem NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf) slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/jobacct_gather_cgroup.so slurmctld: debug: Job accounting gather cgroup plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/ext_sensors_none.so slurmctld: ExtSensors NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/switch_none.so slurmctld: debug: switch NONE plugin loaded slurmctld: debug3: Success. slurmctld: debug: power_save module disabled, SuspendTime < 0 slurmctld: debug3: Trying to load plugin /usr/lib/x86_64-linux-gnu/slurm-wlm/accounting_storage_slurmdbd.so slurmctld: Accounting storage SLURMDBD plugin loaded slurmctld: debug3: Success. slurmctld: debug2: slurm_connect failed: Connection refused slurmctld: debug2: Error connecting slurm stream socket at 10.0.0.100:6819: Connection refused slurmctld: error: slurm_persist_conn_open_without_init: failed to open persistent connection to se01:6819: Connection refused slurmctld: error: slurmdbd: Sending PersistInit msg: Connection refused slurmctld: debug: Association database appears down, reading from state file. slurmctld: debug: create_mmap_buf: Failed to mmap file `/var/spool/slurm/ctld/last_tres`, No such device slurmctld: debug2: No last_tres file (/var/spool/slurm/ctld/last_tres) to recover slurmctld: debug: create_mmap_buf: Failed to mmap file `/var/spool/slurm/ctld/assoc_mgr_state`, No such device slurmctld: debug2: No association state file (/var/spool/slurm/ctld/assoc_mgr_state) to recover slurmctld: fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files.