I think I've worked out a problem I see in your slurm.conf you have this
SlurmdSpoolDir=/var/spool/slurm/d It should be SlurmdSpoolDir=/var/spool/slurmd You'll need to restart slurmd on all the nodes after you make that change I would also double check the permissions on that directory on all your nodes. It needs to be owned by user slurm ls -lad /var/spool/slurmd Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 6 Apr 2021 at 20:37, Sean Crosby <scro...@unimelb.edu.au> wrote: > It looks like your ctl isn't contacting the slurmdbd properly. The control > host, control port etc are all blank. > > The first thing I would do is change the ClusterName in your slurm.conf > from upper case TUC to lower case tuc. You'll then need to restart your > ctld. Then recheck sacctmgr show cluster > > If that doesn't work, try changing AccountingStorageHost in slurm.conf to > localhost as well > > For your worker nodes, your nodes are all in drain state. > > Show the output of > > scontrol show node wn001 > > It will give you the reason for why the node is drained. > > Sean > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Tue, 6 Apr 2021 at 20:19, <ibot...@isc.tuc.gr> wrote: > >> * UoM notice: External email. Be cautious of links, attachments, or >> impersonation attempts * >> ------------------------------ >> >> sinfo -N -o "%N %T %C %m %P %a" >> >> NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL >> >> wn001 drained 0/0/2/2 3934 TUC* up >> >> wn002 drained 0/0/2/2 3934 TUC* up >> >> wn003 drained 0/0/2/2 3934 TUC* up >> >> wn004 drained 0/0/2/2 3934 TUC* up >> >> wn005 drained 0/0/2/2 3934 TUC* up >> >> wn006 drained 0/0/2/2 3934 TUC* up >> >> wn007 drained 0/0/2/2 3934 TUC* up >> >> wn008 drained 0/0/2/2 3934 TUC* up >> >> wn009 drained 0/0/2/2 3934 TUC* up >> >> wn010 drained 0/0/2/2 3934 TUC* up >> >> wn011 drained 0/0/2/2 3934 TUC* up >> >> wn012 drained 0/0/2/2 3934 TUC* up >> >> wn013 drained 0/0/2/2 3934 TUC* up >> >> wn014 drained 0/0/2/2 3934 TUC* up >> >> wn015 drained 0/0/2/2 3934 TUC* up >> >> wn016 drained 0/0/2/2 3934 TUC* up >> >> wn017 drained 0/0/2/2 3934 TUC* up >> >> wn018 drained 0/0/2/2 3934 TUC* up >> >> wn019 drained 0/0/2/2 3934 TUC* up >> >> wn020 drained 0/0/2/2 3934 TUC* up >> >> wn021 drained 0/0/2/2 3934 TUC* up >> >> wn022 drained 0/0/2/2 3934 TUC* up >> >> wn023 drained 0/0/2/2 3934 TUC* up >> >> wn024 drained 0/0/2/2 3934 TUC* up >> >> wn025 drained 0/0/2/2 3934 TUC* up >> >> wn026 drained 0/0/2/2 3934 TUC* up >> >> wn027 drained 0/0/2/2 3934 TUC* up >> >> wn028 drained 0/0/2/2 3934 TUC* up >> >> wn029 drained 0/0/2/2 3934 TUC* up >> >> wn030 drained 0/0/2/2 3934 TUC* up >> >> wn031 drained 0/0/2/2 3934 TUC* up >> >> wn032 drained 0/0/2/2 3934 TUC* up >> >> wn033 drained 0/0/2/2 3934 TUC* up >> >> wn034 drained 0/0/2/2 3934 TUC* up >> >> wn035 drained 0/0/2/2 3934 TUC* up >> >> wn036 drained 0/0/2/2 3934 TUC* up >> >> wn037 drained 0/0/2/2 3934 TUC* up >> >> wn038 drained 0/0/2/2 3934 TUC* up >> >> wn039 drained 0/0/2/2 3934 TUC* up >> >> wn040 drained 0/0/2/2 3934 TUC* up >> >> wn041 drained 0/0/2/2 3934 TUC* up >> >> wn042 drained 0/0/2/2 3934 TUC* up >> >> wn043 drained 0/0/2/2 3934 TUC* up >> >> wn044 drained 0/0/2/2 3934 TUC* up >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Sean Crosby >> *Sent:* Tuesday, April 6, 2021 12:47 PM >> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject:* Re: [slurm-users] [EXT] slurmctld error >> >> >> >> It looks like your attachment of sinfo -R didn't come through >> >> >> >> It also looks like your dbd isn't set up correctly >> >> >> >> Can you also show the output of >> >> >> >> sacctmgr list cluster >> >> >> >> and >> >> >> >> scontrol show config | grep ClusterName >> >> >> >> Sean >> >> >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> >> >> On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: >> >> *UoM notice: *External email. Be cautious of links, attachments, or >> impersonation attempts >> >> >> ------------------------------ >> >> Hi Sean, >> >> >> >> I am trying to submit a simple job but freeze >> >> >> >> srun -n44 -l /bin/hostname >> >> srun: Required node not available (down, drained or reserved) >> >> srun: job 15 queued and waiting for resources >> >> ^Csrun: Job allocation 15 has been revoked >> >> srun: Force Terminated job 15 >> >> >> >> >> >> daemons are active and running on server and all nodes >> >> >> >> nodes definition in slurm.conf is … >> >> >> >> DefMemPerNode=3934 >> >> NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 >> State=UNKNOWN >> >> PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP >> >> >> >> tail -10 /var/log/slurmdbd.log >> >> >> >> [2021-04-06T12:09:16.481] error: We should have gotten a new id: Table >> 'slurm_acct_db.tuc_job_table' doesn't exist >> >> [2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to >> register a cluster (tuc) with no remote port >> >> [2021-04-06T12:09:16.482] error: We should have gotten a new id: Table >> 'slurm_acct_db.tuc_job_table' doesn't exist >> >> [2021-04-06T12:09:16.482] error: It looks like the storage has gone away >> trying to reconnect >> >> [2021-04-06T12:09:16.483] error: We should have gotten a new id: Table >> 'slurm_acct_db.tuc_job_table' doesn't exist >> >> [2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to >> register a cluster (tuc) with no remote port >> >> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table >> 'slurm_acct_db.tuc_job_table' doesn't exist >> >> [2021-04-06T12:09:16.484] error: It looks like the storage has gone away >> trying to reconnect >> >> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table >> 'slurm_acct_db.tuc_job_table' doesn't exist >> >> [2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to >> register a cluster (tuc) with no remote port >> >> >> >> tail -10 /var/log/slurmctld.log >> >> >> >> [2021-04-06T12:09:35.701] debug: backfill: no jobs to backfill >> >> [2021-04-06T12:09:42.001] debug: slurmdbd: PERSIST_RC is -1 from >> DBD_FLUSH_JOBS(1408): (null) >> >> [2021-04-06T12:10:00.042] debug: slurmdbd: PERSIST_RC is -1 from >> DBD_FLUSH_JOBS(1408): (null) >> >> [2021-04-06T12:10:05.701] debug: backfill: beginning >> >> [2021-04-06T12:10:05.701] debug: backfill: no jobs to backfill >> >> [2021-04-06T12:10:05.989] debug: sched: Running job scheduler >> >> [2021-04-06T12:10:19.001] debug: slurmdbd: PERSIST_RC is -1 from >> DBD_FLUSH_JOBS(1408): (null) >> >> [2021-04-06T12:10:35.702] debug: backfill: beginning >> >> [2021-04-06T12:10:35.702] debug: backfill: no jobs to backfill >> >> [2021-04-06T12:10:37.001] debug: slurmdbd: PERSIST_RC is -1 from >> DBD_FLUSH_JOBS(1408): (null) >> >> >> >> Attached sinfo -R >> >> >> >> Any hint? >> >> >> >> jb >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Sean Crosby >> *Sent:* Tuesday, April 6, 2021 7:54 AM >> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject:* Re: [slurm-users] [EXT] slurmctld error >> >> >> >> The other thing I notice for my slurmdbd.conf is that I have >> >> >> >> DbdAddr=localhost >> DbdHost=localhost >> >> >> >> You can try changing your slurmdbd.conf to set those 2 values as well to >> see if that gets slurmdbd to listen on port 6819 >> >> >> >> Sean >> >> >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> >> >> On Tue, 6 Apr 2021 at 14:31, Sean Crosby <scro...@unimelb.edu.au> wrote: >> >> Interesting. It looks like slurmdbd is not opening the 6819 port >> >> >> >> What does >> >> >> >> ss -lntp | grep 6819 >> >> >> >> show? Is something else using that port? >> >> >> >> You can also stop the slurmdbd service and run it in debug mode using >> >> >> >> slurmdbd -D -vvv >> >> >> >> Sean >> >> >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> >> >> On Tue, 6 Apr 2021 at 14:02, <ibot...@isc.tuc.gr> wrote: >> >> *UoM notice: *External email. Be cautious of links, attachments, or >> impersonation attempts >> >> >> ------------------------------ >> >> Hi Sean >> >> >> >> ss -lntp | grep $(pidof slurmdbd) return nothing…… >> >> >> >> systemctl status slurmdbd.service >> >> >> >> ● slurmdbd.service - Slurm DBD accounting daemon >> >> Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; >> vendor preset: enabled) >> >> Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago >> >> Docs: man:slurmdbd(8) >> >> Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS >> (code=exited, status=0/SUCCESS) >> >> Main PID: 1453375 (slurmdbd) >> >> Tasks: 1 >> >> Memory: 5.0M >> >> CGroup: /system.slice/slurmdbd.service >> >> └─1453375 /usr/sbin/slurmdbd >> >> >> >> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD >> accounting daemon... >> >> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't >> open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted >> >> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD >> accounting daemon. >> >> >> >> File /run/slurmdbd.pid exist and has pidof slurmdbd value…. >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Sean Crosby >> *Sent:* Tuesday, April 6, 2021 12:49 AM >> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject:* Re: [slurm-users] [EXT] slurmctld error >> >> >> >> What's the output of >> >> >> >> ss -lntp | grep $(pidof slurmdbd) >> >> >> >> on your dbd host? >> >> >> >> Sean >> >> >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> >> >> On Tue, 6 Apr 2021 at 05:00, <ibot...@isc.tuc.gr> wrote: >> >> *UoM notice: *External email. Be cautious of links, attachments, or >> impersonation attempts >> >> >> ------------------------------ >> >> Hi Sean, >> >> >> >> 10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive…… >> >> >> >> nc -nz 10.0.0.100 6819 || echo Connection not working >> >> >> >> give me back ….. Connection not working >> >> >> >> jb >> >> >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Sean Crosby >> *Sent:* Monday, April 5, 2021 2:52 PM >> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject:* Re: [slurm-users] [EXT] slurmctld error >> >> >> >> The error shows >> >> >> slurmctld: debug2: Error connecting slurm stream socket at >> 10.0.0.100:6819: Connection refused >> >> slurmctld: error: slurm_persist_conn_open_without_init: failed to open >> persistent connection to se01:6819: Connection refused >> >> >> >> Is 10.0.0.100 the IP address of the host running slurmdbd? >> >> If so, check the iptables firewall running on that host, and make sure >> the ctld server can access port 6819 on the dbd host. >> >> You can check this by running the following from the ctld host (requires >> the package nmap-ncat installed) >> >> nc -nz 10.0.0.100 6819 || echo Connection not working >> >> This will try connecting to port 6819 on the host 10.0.0.100, and output >> nothing if the connection works, and would output Connection not working >> otherwise >> >> I would also test this on the DBD server itself >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> >> >> On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: >> >> *UoM notice: *External email. Be cautious of links, attachments, or >> impersonation attempts >> >> >> ------------------------------ >> >> Hi Sean, >> >> >> >> Thank you for your prompt response, I made the changes you suggested, >> slurmctld refuse running……. find attached new slurmctld -Dvvvv >> >> >> >> jb >> >> >> >> >> >> >> >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >> Of *Sean Crosby >> *Sent:* Monday, April 5, 2021 11:46 AM >> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >> *Subject:* Re: [slurm-users] [EXT] slurmctld error >> >> >> >> Hi Jb, >> >> >> >> You have set AccountingStoragePort to 3306 in slurm.conf, which is the >> MySQL port running on the DBD host. >> >> >> >> AccountingStoragePort is the port for the Slurmdbd service, and not for >> MySQL. >> >> >> >> Change AccountingStoragePort to 6819 and it should fix your issues. >> >> >> >> I also think you should comment out the lines >> >> >> >> AccountingStorageUser=slurm >> AccountingStoragePass=/run/munge/munge.socket.2 >> >> >> >> You shouldn't need those lines >> >> >> >> Sean >> >> >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> >> >> On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: >> >> *UoM notice: *External email. Be cautious of links, attachments, or >> impersonation attempts >> >> >> ------------------------------ >> >> Hello everyone, >> >> >> >> I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a >> cluster with 44 identical nodes but I have problem with slurmctld.service >> >> >> >> When I try to activate slurmctd I get the following message… >> >> >> >> fatal: You are running with a database but for some reason we have no >> TRES from it. This should only happen if the database is down and you >> don't have any state files >> >> >> >> - Ubuntu 20.04.2 runs on the server and nodes in the exact same >> version. >> - munge 0.5.13 installed from Ubuntu repo running on server and nodes. >> - mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu)) >> installed from ubuntu repo running on server. >> >> >> >> slurm.conf is the same on all nodes and on server. >> >> >> >> slurmd.service is active and running on all nodes without problem. >> >> >> >> mysql.service is active and running on server. >> >> slurmdbd.service is active and running on server (slurm_acct_db created). >> >> >> >> Find attached slurm.conf slurmdbd.com and detailed output of slurmctld >> -Dvvvv command. >> >> >> >> Any hint? >> >> >> >> Thanks in advance >> >> >> >> jb >> >> >> >> >> >> >> >>