A few things to check here:
* Ensure that your firewall ports are open – ports 6817/6818/6819/3306 * Make sure that munge is working correctly: $ munge -n | unmunge * Make sure you go through the accounting web-page as well - https://slurm.schedmd.com/accounting.html * In particular, ensure that you can connect to the MySQL server, create the slurm user within MySQL database, give it the required permissions, etc, Go through the “Live example” on the accounting web-page. * Walk through your log files – especially the slurmdbd.log file and clear up all errors. * As a general comment, put in the fewest number of configuration options into your slurm.conf and slurmdbd.conf file as possible – use the defaults when you can. Add items incrementally and carefully so you can back-out easily when you make mistakes (and you will!) * In my slurm.conf, I also have specified the AccountingStorageHost, AccountingStorageUser and AccountingStoragePort – not sure if I need any of these though… Mike From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of slurm-users-requ...@lists.schedmd.com <slurm-users-requ...@lists.schedmd.com> Date: Tuesday, February 2, 2021 at 8:16 AM To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: slurm-users Digest, Vol 40, Issue 4 Send slurm-users mailing list submissions to slurm-users@lists.schedmd.com To subscribe or unsubscribe via the World Wide Web, visit https://lists.schedmd.com/cgi-bin/mailman/listinfo/slurm-users or, via email, send a message with subject or body 'help' to slurm-users-requ...@lists.schedmd.com You can reach the person managing the list at slurm-users-ow...@lists.schedmd.com When replying, please edit your Subject line so it is more specific than "Re: Contents of slurm-users digest..." Today's Topics: 1. Slurm - sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused (Zainul Abiddin) 2. Re: Slurm - Munge configuration details (Benson Muite) ---------------------------------------------------------------------- Message: 1 Date: Tue, 2 Feb 2021 18:35:20 +0530 From: Zainul Abiddin <zainul1...@gmail.com> To: slurm-users@lists.schedmd.com Subject: [slurm-users] Slurm - sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused Message-ID: <caa9r82u0l7vdzdhvp_1kfwmvrll-cc5vhavr2sgtuwn_1ax...@mail.gmail.com> Content-Type: text/plain; charset="utf-8" Hi All, I have done slurmdbd configuration and while i am trying to run account manager with *sacct* i am getting below error. [root@smaster ~]# sacct sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused sacct: error: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused [root@smaster ~]# My slurmdbd configuration : [root@smaster ~]# cat /etc/slurm/slurmdbd.conf AuthType=auth/munge DbdAddr=localhost DbdHost=localhost SlurmUser=slurm DebugLevel=4 LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=localhost StoragePass=password StorageUser=slurm StorageLoc=slurm_acct_db [root@smaster ~]# chown slurm: /etc/slurm/slurmdbd.conf [root@smaster ~]# chmod 600 /etc/slurm/slurmdbd.conf [root@smaster ~]# mkdir /var/log/slurm [root@smaster ~]# touch /var/log/slurm/slurmdbd.log [root@smaster ~]# chown slurm: /var/log/slurm/slurmdbd.log [root@smaster ~]# scontrol show config | grep AccountingStorageHost AccountingStorageHost = localhost Note: i have edited file /etc/slurm/slurm.conf and modified the below line # LOGGING AND ACCOUNTING AccountingStorageType=accounting_storage/slurmdbd Then restarted all the services [root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i status; done Redirecting to /bin/systemctl status munge.service ? munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago Docs: man:munged(8) Main PID: 20613 (munged) CGroup: /system.slice/munge.service ??20613 /usr/sbin/munged Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE authentication service. Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE authentication service... Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE authentication service. Redirecting to /bin/systemctl status slurmd.service ? slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 36min ago Main PID: 20637 (slurmd) CGroup: /system.slice/slurmd.service ??20637 /usr/sbin/slurmd -D Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started Slurm node daemon. Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 7 for UID 0 Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 8 for UID 0 Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 9 for UID 0 Redirecting to /bin/systemctl status slurmctld.service ? slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 36min ago Main PID: 20660 (slurmctld) CGroup: /system.slice/slurmctld.service ??20660 /usr/sbin/slurmctld -D Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm controller daemon. Redirecting to /bin/systemctl status slurmdbd.service ? slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 16:29:11 IST; 28min ago Main PID: 24146 (slurmdbd) CGroup: /system.slice/slurmdbd.service ??24146 /usr/sbin/slurmdbd -D Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD accounting daemon. [root@smaster ~]# srun --ntasks=2 --label /bin/hostname srun: job 22 queued and waiting for resources srun: job 22 has been allocated resources 1: smaster.calligotech.com 0: smaster.calligotech.com [root@smaster ~]# However when i run the below command [root@smaster ~]# sacct sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused sacct: error: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused [root@smaster ~]# and i have troubleshooted below steps [root@smaster ~]# telnet localhost 6819 Trying ::1... telnet: connect to address ::1: Connection refused Trying 127.0.0.1... telnet: connect to address 127.0.0.1: Connection refused [root@smaster ~]# [root@smaster ~]# mysql -p -u slurm slurm_acct_db Enter password: Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 9 Server version: 10.1.48-MariaDB MariaDB Server Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [slurm_acct_db]> show tables; Empty set (0.00 sec) MariaDB [slurm_acct_db]> Then i have added DBPort and restarted services [root@smaster ~]# cat /etc/slurm/slurmdbd.conf AuthType=auth/munge DbdAddr=localhost DbdHost=localhost *DbdPort=6819* SlurmUser=slurm DebugLevel=4 LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=localhost StoragePass=password StorageUser=slurm StorageLoc=slurm_acct_db [root@smaster ~]# [root@smaster ~]# for i in munge slurmd slurmctld slurmdbd; do service $i status; done Redirecting to /bin/systemctl status munge.service ? munge.service - MUNGE authentication service Loaded: loaded (/usr/lib/systemd/system/munge.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago Docs: man:munged(8) Main PID: 20613 (munged) CGroup: /system.slice/munge.service ??20613 /usr/sbin/munged Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Stopped MUNGE authentication service. Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Starting MUNGE authentication service... Feb 02 13:21:10 smaster.calligotech.com systemd[1]: Started MUNGE authentication service. Redirecting to /bin/systemctl status slurmd.service ? slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 13:21:10 IST; 3h 55min ago Main PID: 20637 (slurmd) CGroup: /system.slice/slurmd.service ??20637 /usr/sbin/slurmd -D Feb 02 15:30:47 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 7 for UID 0 Feb 02 15:31:46 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 8 for UID 0 Feb 02 15:33:43 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 9 for UID 0 Feb 02 15:38:45 smaster.calligotech.com slurmd[20637]: slurmd: Launching batch job 12 for UID 0 Redirecting to /bin/systemctl status slurmctld.service ? slurmctld.service - Slurm controller daemon Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 13:21:11 IST; 3h 55min ago Main PID: 20660 (slurmctld) CGroup: /system.slice/slurmctld.service ??20660 /usr/sbin/slurmctld -D Feb 02 13:21:11 smaster.calligotech.com systemd[1]: Started Slurm controller daemon. Redirecting to /bin/systemctl status slurmdbd.service ? slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2021-02-02 16:29:11 IST; 47min ago Main PID: 24146 (slurmdbd) CGroup: /system.slice/slurmdbd.service ??24146 /usr/sbin/slurmdbd -D Feb 02 16:29:11 smaster.calligotech.com systemd[1]: Started Slurm DBD accounting daemon. [root@smaster ~]# ps -ef |grep slurm root 20637 1 0 13:21 ? 00:00:00 /usr/sbin/slurmd -D slurm 20660 1 0 13:21 ? 00:00:08 /usr/sbin/slurmctld -D root 24146 1 0 16:29 ? 00:00:00 /usr/sbin/slurmdbd -D root 25395 18378 0 17:17 pts/2 00:00:00 grep --color=auto slurm [root@smaster ~]# sacct sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:localhost:6819: Connection refused sacct: error: Sending PersistInit msg: Connection refused sacct: error: Problem talking to the database: Connection refused [root@smaster ~]# [root@smaster ~]# tail /var/log/slurm/slurmdbd.log [2021-02-02T17:16:01.913] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2) [2021-02-02T17:16:01.913] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2021-02-02T17:16:06.963] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2) [2021-02-02T17:16:06.963] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2021-02-02T17:16:12.083] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2) [2021-02-02T17:16:12.083] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2021-02-02T17:16:17.140] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2) [2021-02-02T17:16:17.141] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [2021-02-02T17:16:22.804] error: mysql_real_connect failed: 2005 Unknown MySQL server host 'smater' (-2) [2021-02-02T17:16:22.804] error: The database must be up when starting the MYSQL plugin. Trying again in 5 seconds. [root@smaster ~]# Still the problem remains the same. Please help me to resolve this issue. Regards, Zain -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210202/f2348489/attachment-0001.htm> ------------------------------ Message: 2 Date: Tue, 2 Feb 2021 16:16:09 +0300 From: Benson Muite <benson_mu...@emailplus.org> To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Slurm - Munge configuration details Message-ID: <bd36d545-4fd7-05ec-4a51-bb2743258...@emailplus.org> Content-Type: text/plain; charset=utf-8; format=flowed On 2/2/21 4:00 PM, Zainul Abiddin wrote: > Hi Benson, > > I am not able to do passwordless ssh? between master and compute nodes > using Munge service. > when i am running below command , here it is asking for a password for > the compute node. > > /Am I configuring properly or not, so I need clarity on this?/ > > [root@smaster ~]# munge -n | ssh snode unmunge > root@snode's password: > STATUS: ? ? ? ? ? Success (0) > ENCODE_HOST: smaster.calligotech.com > <http://smaster.calligotech.com/>?(192.168.1.195<http://smaster.calligotech.com/%3e?(192.168.1.195>) > ENCODE_TIME: ? ? ?2021-02-01 13:58:16 +0530 (1612168096) > DECODE_TIME: ? ? ?2021-02-01 13:58:21 +0530 (1612168101) > TTL: ? ? ? ? ? ? ?300 > CIPHER: ? ? ? ? ? aes128 (4) > MAC: ? ? ? ? ? ? ?sha1 (3) > ZIP: ? ? ? ? ? ? ?none (0) > UID: ? ? ? ? ? ? ?root (0) > GID: ? ? ? ? ? ? ?root (0) > LENGTH: ? ? ? ? ? 0 > > [root@smaster ~]# > > Regards, > Zain > Hi Zain, Perhaps try using the ipaddress instead of the hostname? Also, are clocks synchronized? See https://slurm.schedmd.com/quickstart_admin.html Benson End of slurm-users Digest, Vol 40, Issue 4 ******************************************