It looks like your attachment of sinfo -R didn't come through It also looks like your dbd isn't set up correctly
Can you also show the output of sacctmgr list cluster and scontrol show config | grep ClusterName Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 6 Apr 2021 at 19:18, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: > * UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts * > ------------------------------ > > Hi Sean, > > > > I am trying to submit a simple job but freeze > > > > srun -n44 -l /bin/hostname > > srun: Required node not available (down, drained or reserved) > > srun: job 15 queued and waiting for resources > > ^Csrun: Job allocation 15 has been revoked > > srun: Force Terminated job 15 > > > > > > daemons are active and running on server and all nodes > > > > nodes definition in slurm.conf is … > > > > DefMemPerNode=3934 > > NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 > State=UNKNOWN > > PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > > > tail -10 /var/log/slurmdbd.log > > > > [2021-04-06T12:09:16.481] error: We should have gotten a new id: Table > 'slurm_acct_db.tuc_job_table' doesn't exist > > [2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to > register a cluster (tuc) with no remote port > > [2021-04-06T12:09:16.482] error: We should have gotten a new id: Table > 'slurm_acct_db.tuc_job_table' doesn't exist > > [2021-04-06T12:09:16.482] error: It looks like the storage has gone away > trying to reconnect > > [2021-04-06T12:09:16.483] error: We should have gotten a new id: Table > 'slurm_acct_db.tuc_job_table' doesn't exist > > [2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to > register a cluster (tuc) with no remote port > > [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table > 'slurm_acct_db.tuc_job_table' doesn't exist > > [2021-04-06T12:09:16.484] error: It looks like the storage has gone away > trying to reconnect > > [2021-04-06T12:09:16.484] error: We should have gotten a new id: Table > 'slurm_acct_db.tuc_job_table' doesn't exist > > [2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to > register a cluster (tuc) with no remote port > > > > tail -10 /var/log/slurmctld.log > > > > [2021-04-06T12:09:35.701] debug: backfill: no jobs to backfill > > [2021-04-06T12:09:42.001] debug: slurmdbd: PERSIST_RC is -1 from > DBD_FLUSH_JOBS(1408): (null) > > [2021-04-06T12:10:00.042] debug: slurmdbd: PERSIST_RC is -1 from > DBD_FLUSH_JOBS(1408): (null) > > [2021-04-06T12:10:05.701] debug: backfill: beginning > > [2021-04-06T12:10:05.701] debug: backfill: no jobs to backfill > > [2021-04-06T12:10:05.989] debug: sched: Running job scheduler > > [2021-04-06T12:10:19.001] debug: slurmdbd: PERSIST_RC is -1 from > DBD_FLUSH_JOBS(1408): (null) > > [2021-04-06T12:10:35.702] debug: backfill: beginning > > [2021-04-06T12:10:35.702] debug: backfill: no jobs to backfill > > [2021-04-06T12:10:37.001] debug: slurmdbd: PERSIST_RC is -1 from > DBD_FLUSH_JOBS(1408): (null) > > > > Attached sinfo -R > > > > Any hint? > > > > jb > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Crosby > *Sent:* Tuesday, April 6, 2021 7:54 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] [EXT] slurmctld error > > > > The other thing I notice for my slurmdbd.conf is that I have > > > > DbdAddr=localhost > DbdHost=localhost > > > > You can try changing your slurmdbd.conf to set those 2 values as well to > see if that gets slurmdbd to listen on port 6819 > > > > Sean > > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Tue, 6 Apr 2021 at 14:31, Sean Crosby <scro...@unimelb.edu.au> wrote: > > Interesting. It looks like slurmdbd is not opening the 6819 port > > > > What does > > > > ss -lntp | grep 6819 > > > > show? Is something else using that port? > > > > You can also stop the slurmdbd service and run it in debug mode using > > > > slurmdbd -D -vvv > > > > Sean > > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Tue, 6 Apr 2021 at 14:02, <ibot...@isc.tuc.gr> wrote: > > *UoM notice: *External email. Be cautious of links, attachments, or > impersonation attempts > > > ------------------------------ > > Hi Sean > > > > ss -lntp | grep $(pidof slurmdbd) return nothing…… > > > > systemctl status slurmdbd.service > > > > ● slurmdbd.service - Slurm DBD accounting daemon > > Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; vendor > preset: enabled) > > Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago > > Docs: man:slurmdbd(8) > > Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS > (code=exited, status=0/SUCCESS) > > Main PID: 1453375 (slurmdbd) > > Tasks: 1 > > Memory: 5.0M > > CGroup: /system.slice/slurmdbd.service > > └─1453375 /usr/sbin/slurmdbd > > > > Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD > accounting daemon... > > Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't open > PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted > > Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD accounting > daemon. > > > > File /run/slurmdbd.pid exist and has pidof slurmdbd value…. > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Crosby > *Sent:* Tuesday, April 6, 2021 12:49 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] [EXT] slurmctld error > > > > What's the output of > > > > ss -lntp | grep $(pidof slurmdbd) > > > > on your dbd host? > > > > Sean > > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Tue, 6 Apr 2021 at 05:00, <ibot...@isc.tuc.gr> wrote: > > *UoM notice: *External email. Be cautious of links, attachments, or > impersonation attempts > > > ------------------------------ > > Hi Sean, > > > > 10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive…… > > > > nc -nz 10.0.0.100 6819 || echo Connection not working > > > > give me back ….. Connection not working > > > > jb > > > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Crosby > *Sent:* Monday, April 5, 2021 2:52 PM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] [EXT] slurmctld error > > > > The error shows > > > slurmctld: debug2: Error connecting slurm stream socket at 10.0.0.100:6819: > Connection refused > > slurmctld: error: slurm_persist_conn_open_without_init: failed to open > persistent connection to se01:6819: Connection refused > > > > Is 10.0.0.100 the IP address of the host running slurmdbd? > > If so, check the iptables firewall running on that host, and make sure the > ctld server can access port 6819 on the dbd host. > > You can check this by running the following from the ctld host (requires > the package nmap-ncat installed) > > nc -nz 10.0.0.100 6819 || echo Connection not working > > This will try connecting to port 6819 on the host 10.0.0.100, and output > nothing if the connection works, and would output Connection not working > otherwise > > I would also test this on the DBD server itself > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Mon, 5 Apr 2021 at 21:00, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: > > *UoM notice: *External email. Be cautious of links, attachments, or > impersonation attempts > > > ------------------------------ > > Hi Sean, > > > > Thank you for your prompt response, I made the changes you suggested, > slurmctld refuse running……. find attached new slurmctld -Dvvvv > > > > jb > > > > > > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of > *Sean Crosby > *Sent:* Monday, April 5, 2021 11:46 AM > *To:* Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject:* Re: [slurm-users] [EXT] slurmctld error > > > > Hi Jb, > > > > You have set AccountingStoragePort to 3306 in slurm.conf, which is the > MySQL port running on the DBD host. > > > > AccountingStoragePort is the port for the Slurmdbd service, and not for > MySQL. > > > > Change AccountingStoragePort to 6819 and it should fix your issues. > > > > I also think you should comment out the lines > > > > AccountingStorageUser=slurm > AccountingStoragePass=/run/munge/munge.socket.2 > > > > You shouldn't need those lines > > > > Sean > > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > > > > > On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibot...@isc.tuc.gr> wrote: > > *UoM notice: *External email. Be cautious of links, attachments, or > impersonation attempts > > > ------------------------------ > > Hello everyone, > > > > I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a > cluster with 44 identical nodes but I have problem with slurmctld.service > > > > When I try to activate slurmctd I get the following message… > > > > fatal: You are running with a database but for some reason we have no TRES > from it. This should only happen if the database is down and you don't > have any state files > > > > - Ubuntu 20.04.2 runs on the server and nodes in the exact same > version. > - munge 0.5.13 installed from Ubuntu repo running on server and nodes. > - mysql Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu)) > installed from ubuntu repo running on server. > > > > slurm.conf is the same on all nodes and on server. > > > > slurmd.service is active and running on all nodes without problem. > > > > mysql.service is active and running on server. > > slurmdbd.service is active and running on server (slurm_acct_db created). > > > > Find attached slurm.conf slurmdbd.com and detailed output of slurmctld > -Dvvvv command. > > > > Any hint? > > > > Thanks in advance > > > > jb > > > > > > > >