I just checked my cluster and my spool dir is SlurmdSpoolDir=/var/spool/slurm
(i.e. without the d at the end) It doesn't really matter, as long as the directory exists and has the correct permissions on all nodes -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 6 Apr 2021 at 20:52, Sean Crosby <scro...@unimelb.edu.au> wrote: > I think I've worked out a problem > > I see in your slurm.conf you have this > > SlurmdSpoolDir=/var/spool/slurm/d > > It should be > > SlurmdSpoolDir=/var/spool/slurmd > > You'll need to restart slurmd on all the nodes after you make that change > > I would also double check the permissions on that directory on all your > nodes. It needs to be owned by user slurm > > ls -lad /var/spool/slurmd > > Sean > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > On Tue, 6 Apr 2021 at 20:37, Sean Crosby <scro...@unimelb.edu.au> wrote: > >> It looks like your ctl isn't contacting the slurmdbd properly. The >> control host, control port etc are all blank. >> >> The first thing I would do is change the ClusterName in your slurm.conf >> from upper case TUC to lower case tuc. You'll then need to restart your >> ctld. Then recheck sacctmgr show cluster >> >> If that doesn't work, try changing AccountingStorageHost in slurm.conf to >> localhost as well >> >> For your worker nodes, your nodes are all in drain state. >> >> Show the output of >> >> scontrol show node wn001 >> >> It will give you the reason for why the node is drained. >> >> Sean >> >> -- >> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >> Research Computing Services | Business Services >> The University of Melbourne, Victoria 3010 Australia >> >> >> >> On Tue, 6 Apr 2021 at 20:19, <ibot...@isc.tuc.gr> wrote: >> >>> * UoM notice: External email. Be cautious of links, attachments, or >>> impersonation attempts * >>> ------------------------------ >>> >>> sinfo -N -o "%N %T %C %m %P %a" >>> >>> NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL >>> >>> wn001 drained 0/0/2/2 3934 TUC* up >>> >>> wn002 drained 0/0/2/2 3934 TUC* up >>> >>> wn003 drained 0/0/2/2 3934 TUC* up >>> >>> wn004 drained 0/0/2/2 3934 TUC* up >>> >>> wn005 drained 0/0/2/2 3934 TUC* up >>> >>> wn006 drained 0/0/2/2 3934 TUC* up >>> >>> wn007 drained 0/0/2/2 3934 TUC* up >>> >>> wn008 drained 0/0/2/2 3934 TUC* up >>> >>> wn009 drained 0/0/2/2 3934 TUC* up >>> >>> wn010 drained 0/0/2/2 3934 TUC* up >>> >>> wn011 drained 0/0/2/2 3934 TUC* up >>> >>> wn012 drained 0/0/2/2 3934 TUC* up >>> >>> wn013 drained 0/0/2/2 3934 TUC* up >>> >>> wn014 drained 0/0/2/2 3934 TUC* up >>> >>> wn015 drained 0/0/2/2 3934 TUC* up >>> >>> wn016 drained 0/0/2/2 3934 TUC* up >>> >>> wn017 drained 0/0/2/2 3934 TUC* up >>> >>> wn018 drained 0/0/2/2 3934 TUC* up >>> >>> wn019 drained 0/0/2/2 3934 TUC* up >>> >>> wn020 drained 0/0/2/2 3934 TUC* up >>> >>> wn021 drained 0/0/2/2 3934 TUC* up >>> >>> wn022 drained 0/0/2/2 3934 TUC* up >>> >>> wn023 drained 0/0/2/2 3934 TUC* up >>> >>> wn024 drained 0/0/2/2 3934 TUC* up >>> >>> wn025 drained 0/0/2/2 3934 TUC* up >>> >>> wn026 drained 0/0/2/2 3934 TUC* up >>> >>> wn027 drained 0/0/2/2 3934 TUC* up >>> >>> wn028 drained 0/0/2/2 3934 TUC* up >>> >>> wn029 drained 0/0/2/2 3934 TUC* up >>> >>> wn030 drained 0/0/2/2 3934 TUC* up >>> >>> wn031 drained 0/0/2/2 3934 TUC* up >>> >>> wn032 drained 0/0/2/2 3934 TUC* up >>> >>> wn033 drained 0/0/2/2 3934 TUC* up >>> >>> wn034 drained 0/0/2/2 3934 TUC* up >>> >>> wn035 drained 0/0/2/2 3934 TUC* up >>> >>> wn036 drained 0/0/2/2 3934 TUC* up >>> >>> wn037 drained 0/0/2/2 3934 TUC* up >>> >>> wn038 drained 0/0/2/2 3934 TUC* up >>> >>> wn039 drained 0/0/2/2 3934 TUC* up >>> >>> wn040 drained 0/0/2/2 3934 TUC* up >>> >>> wn041 drained 0/0/2/2 3934 TUC* up >>> >>> wn042 drained 0/0/2/2 3934 TUC* up >>> >>> wn043 drained 0/0/2/2 3934 TUC* up >>> >>> wn044 drained 0/0/2/2 3934 TUC* up >>> >>> >>> >>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >>> Of *Sean Crosby >>> *Sent:* Tuesday, April 6, 2021 12:47 PM >>> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >>> *Subject:* Re: [slurm-users] [EXT] slurmctld error >>> >>> >>> >>> It looks like your attachment of sinfo -R didn't come through >>> >>> >>> >>> It also looks like your dbd isn't set up correctly >>> >>> >>> >>> Can you also show the output of >>> >>> >>> >>> sacctmgr list cluster >>> >>> >>> >>> and >>> >>> >>> >>> scontrol show config | grep ClusterName >>> >>> >>> >>> Sean >>> >>> >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> >>> >>> On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: >>> >>> *UoM notice: *External email. Be cautious of links, attachments, or >>> impersonation attempts >>> >>> >>> ------------------------------ >>> >>> Hi Sean, >>> >>> >>> >>> I am trying to submit a simple job but freeze >>> >>> >>> >>> srun -n44 -l /bin/hostname >>> >>> srun: Required node not available (down, drained or reserved) >>> >>> srun: job 15 queued and waiting for resources >>> >>> ^Csrun: Job allocation 15 has been revoked >>> >>> srun: Force Terminated job 15 >>> >>> >>> >>> >>> >>> daemons are active and running on server and all nodes >>> >>> >>> >>> nodes definition in slurm.conf is … >>> >>> >>> >>> DefMemPerNode=3934 >>> >>> NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 >>> State=UNKNOWN >>> >>> PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP >>> >>> >>> >>> tail -10 /var/log/slurmdbd.log >>> >>> >>> >>> [2021-04-06T12:09:16.481] error: We should have gotten a new id: Table >>> 'slurm_acct_db.tuc_job_table' doesn't exist >>> >>> [2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to >>> register a cluster (tuc) with no remote port >>> >>> [2021-04-06T12:09:16.482] error: We should have gotten a new id: Table >>> 'slurm_acct_db.tuc_job_table' doesn't exist >>> >>> [2021-04-06T12:09:16.482] error: It looks like the storage has gone away >>> trying to reconnect >>> >>> [2021-04-06T12:09:16.483] error: We should have gotten a new id: Table >>> 'slurm_acct_db.tuc_job_table' doesn't exist >>> >>> [2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to >>> register a cluster (tuc) with no remote port >>> >>> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table >>> 'slurm_acct_db.tuc_job_table' doesn't exist >>> >>> [2021-04-06T12:09:16.484] error: It looks like the storage has gone away >>> trying to reconnect >>> >>> [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table >>> 'slurm_acct_db.tuc_job_table' doesn't exist >>> >>> [2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to >>> register a cluster (tuc) with no remote port >>> >>> >>> >>> tail -10 /var/log/slurmctld.log >>> >>> >>> >>> [2021-04-06T12:09:35.701] debug: backfill: no jobs to backfill >>> >>> [2021-04-06T12:09:42.001] debug: slurmdbd: PERSIST_RC is -1 from >>> DBD_FLUSH_JOBS(1408): (null) >>> >>> [2021-04-06T12:10:00.042] debug: slurmdbd: PERSIST_RC is -1 from >>> DBD_FLUSH_JOBS(1408): (null) >>> >>> [2021-04-06T12:10:05.701] debug: backfill: beginning >>> >>> [2021-04-06T12:10:05.701] debug: backfill: no jobs to backfill >>> >>> [2021-04-06T12:10:05.989] debug: sched: Running job scheduler >>> >>> [2021-04-06T12:10:19.001] debug: slurmdbd: PERSIST_RC is -1 from >>> DBD_FLUSH_JOBS(1408): (null) >>> >>> [2021-04-06T12:10:35.702] debug: backfill: beginning >>> >>> [2021-04-06T12:10:35.702] debug: backfill: no jobs to backfill >>> >>> [2021-04-06T12:10:37.001] debug: slurmdbd: PERSIST_RC is -1 from >>> DBD_FLUSH_JOBS(1408): (null) >>> >>> >>> >>> Attached sinfo -R >>> >>> >>> >>> Any hint? >>> >>> >>> >>> jb >>> >>> >>> >>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >>> Of *Sean Crosby >>> *Sent:* Tuesday, April 6, 2021 7:54 AM >>> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >>> *Subject:* Re: [slurm-users] [EXT] slurmctld error >>> >>> >>> >>> The other thing I notice for my slurmdbd.conf is that I have >>> >>> >>> >>> DbdAddr=localhost >>> DbdHost=localhost >>> >>> >>> >>> You can try changing your slurmdbd.conf to set those 2 values as well to >>> see if that gets slurmdbd to listen on port 6819 >>> >>> >>> >>> Sean >>> >>> >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> >>> >>> On Tue, 6 Apr 2021 at 14:31, Sean Crosby <scro...@unimelb.edu.au> wrote: >>> >>> Interesting. It looks like slurmdbd is not opening the 6819 port >>> >>> >>> >>> What does >>> >>> >>> >>> ss -lntp | grep 6819 >>> >>> >>> >>> show? Is something else using that port? >>> >>> >>> >>> You can also stop the slurmdbd service and run it in debug mode using >>> >>> >>> >>> slurmdbd -D -vvv >>> >>> >>> >>> Sean >>> >>> >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> >>> >>> On Tue, 6 Apr 2021 at 14:02, <ibot...@isc.tuc.gr> wrote: >>> >>> *UoM notice: *External email. Be cautious of links, attachments, or >>> impersonation attempts >>> >>> >>> ------------------------------ >>> >>> Hi Sean >>> >>> >>> >>> ss -lntp | grep $(pidof slurmdbd) return nothing…… >>> >>> >>> >>> systemctl status slurmdbd.service >>> >>> >>> >>> ● slurmdbd.service - Slurm DBD accounting daemon >>> >>> Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; >>> vendor preset: enabled) >>> >>> Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago >>> >>> Docs: man:slurmdbd(8) >>> >>> Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS >>> (code=exited, status=0/SUCCESS) >>> >>> Main PID: 1453375 (slurmdbd) >>> >>> Tasks: 1 >>> >>> Memory: 5.0M >>> >>> CGroup: /system.slice/slurmdbd.service >>> >>> └─1453375 /usr/sbin/slurmdbd >>> >>> >>> >>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD >>> accounting daemon... >>> >>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't >>> open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted >>> >>> Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD >>> accounting daemon. >>> >>> >>> >>> File /run/slurmdbd.pid exist and has pidof slurmdbd value…. >>> >>> >>> >>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >>> Of *Sean Crosby >>> *Sent:* Tuesday, April 6, 2021 12:49 AM >>> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >>> *Subject:* Re: [slurm-users] [EXT] slurmctld error >>> >>> >>> >>> What's the output of >>> >>> >>> >>> ss -lntp | grep $(pidof slurmdbd) >>> >>> >>> >>> on your dbd host? >>> >>> >>> >>> Sean >>> >>> >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> >>> >>> On Tue, 6 Apr 2021 at 05:00, <ibot...@isc.tuc.gr> wrote: >>> >>> *UoM notice: *External email. Be cautious of links, attachments, or >>> impersonation attempts >>> >>> >>> ------------------------------ >>> >>> Hi Sean, >>> >>> >>> >>> 10.0.0.100 is the dbd and ctld host with name se01. Firewall is >>> inactive…… >>> >>> >>> >>> nc -nz 10.0.0.100 6819 || echo Connection not working >>> >>> >>> >>> give me back ….. Connection not working >>> >>> >>> >>> jb >>> >>> >>> >>> >>> >>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >>> Of *Sean Crosby >>> *Sent:* Monday, April 5, 2021 2:52 PM >>> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >>> *Subject:* Re: [slurm-users] [EXT] slurmctld error >>> >>> >>> >>> The error shows >>> >>> >>> slurmctld: debug2: Error connecting slurm stream socket at >>> 10.0.0.100:6819: Connection refused >>> >>> slurmctld: error: slurm_persist_conn_open_without_init: failed to open >>> persistent connection to se01:6819: Connection refused >>> >>> >>> >>> Is 10.0.0.100 the IP address of the host running slurmdbd? >>> >>> If so, check the iptables firewall running on that host, and make sure >>> the ctld server can access port 6819 on the dbd host. >>> >>> You can check this by running the following from the ctld host (requires >>> the package nmap-ncat installed) >>> >>> nc -nz 10.0.0.100 6819 || echo Connection not working >>> >>> This will try connecting to port 6819 on the host 10.0.0.100, and output >>> nothing if the connection works, and would output Connection not working >>> otherwise >>> >>> I would also test this on the DBD server itself >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> >>> >>> On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: >>> >>> *UoM notice: *External email. Be cautious of links, attachments, or >>> impersonation attempts >>> >>> >>> ------------------------------ >>> >>> Hi Sean, >>> >>> >>> >>> Thank you for your prompt response, I made the changes you suggested, >>> slurmctld refuse running……. find attached new slurmctld -Dvvvv >>> >>> >>> >>> jb >>> >>> >>> >>> >>> >>> >>> >>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf >>> Of *Sean Crosby >>> *Sent:* Monday, April 5, 2021 11:46 AM >>> *To:* Slurm User Community List <slurm-users@lists.schedmd.com> >>> *Subject:* Re: [slurm-users] [EXT] slurmctld error >>> >>> >>> >>> Hi Jb, >>> >>> >>> >>> You have set AccountingStoragePort to 3306 in slurm.conf, which is the >>> MySQL port running on the DBD host. >>> >>> >>> >>> AccountingStoragePort is the port for the Slurmdbd service, and not for >>> MySQL. >>> >>> >>> >>> Change AccountingStoragePort to 6819 and it should fix your issues. >>> >>> >>> >>> I also think you should comment out the lines >>> >>> >>> >>> AccountingStorageUser=slurm >>> AccountingStoragePass=/run/munge/munge.socket.2 >>> >>> >>> >>> You shouldn't need those lines >>> >>> >>> >>> Sean >>> >>> >>> >>> -- >>> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead >>> Research Computing Services | Business Services >>> The University of Melbourne, Victoria 3010 Australia >>> >>> >>> >>> >>> >>> On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: >>> >>> *UoM notice: *External email. Be cautious of links, attachments, or >>> impersonation attempts >>> >>> >>> ------------------------------ >>> >>> Hello everyone, >>> >>> >>> >>> I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a >>> cluster with 44 identical nodes but I have problem with slurmctld.service >>> >>> >>> >>> When I try to activate slurmctd I get the following message… >>> >>> >>> >>> fatal: You are running with a database but for some reason we have no >>> TRES from it. This should only happen if the database is down and you >>> don't have any state files >>> >>> >>> >>> - Ubuntu 20.04.2 runs on the server and nodes in the exact same >>> version. >>> - munge 0.5.13 installed from Ubuntu repo running on server and >>> nodes. >>> - mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu)) >>> installed from ubuntu repo running on server. >>> >>> >>> >>> slurm.conf is the same on all nodes and on server. >>> >>> >>> >>> slurmd.service is active and running on all nodes without problem. >>> >>> >>> >>> mysql.service is active and running on server. >>> >>> slurmdbd.service is active and running on server (slurm_acct_db created). >>> >>> >>> >>> Find attached slurm.conf slurmdbd.com and detailed output of slurmctld >>> -Dvvvv command. >>> >>> >>> >>> Any hint? >>> >>> >>> >>> Thanks in advance >>> >>> >>> >>> jb >>> >>> >>> >>> >>> >>> >>> >>>