Step back from slurm and confirm that MariaDb is up and responsive. # mysql -uroot -pEnter password: Welcome to the MariaDB monitor. Commands end with ; or \g.Your MariaDB connection id is 8Server version: 10.2.9-MariaDB MariaDB Server Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> select table_schema, table_name from information_schema.tables;
On Wednesday, November 29, 2017 10:17 AM, Bruno Santos <bacmsan...@gmail.com> wrote: Hi Barbara, This is a fresh install. I have installed slurm from source on Debian stretch and now trying to set it up correctly. MariaDB is running for but I am confused about the database configuration. I followed a tutorial (I can no longer find it) that showed me how to create the database and give it to the slurm user on mysql. Haven't really done anything further than that as running anything return the same errors: root@plantae:~# sacctmgr show user -s sacctmgr: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT sacctmgr: error: slurmdbd: Sending PersistInit msg: No error sacctmgr: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT sacctmgr: error: slurmdbd: Sending PersistInit msg: No error sacctmgr: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT sacctmgr: error: slurmdbd: Sending PersistInit msg: No error sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error Problem with query. On 29 November 2017 at 14:46, Barbara Krašovec <barbara.kraso...@ijs.si> wrote: Did you upgrade SLURM or is it a fresh install? Are there any associations set? For instance, did you create the cluster with sacctmgr?sacctmgr add cluster <name> Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a simple test, such as:sacctmgr show user -sIf it was an upgrade, did you try to run the slurmdbd and slurmctld manuallly first: slurmdbd -Dvvvvv Then controller: slurmctld -Dvvvvv Which OS is that?Is there a firewall/selinux/ACLs? Cheers,Barbara On 29 Nov 2017, at 15:19, Bruno Santos <bacmsan...@gmail.com> wrote: Thank you Barbara, Unfortunately, it does not seem to be a munge problem. Munge can successfully authenticate with the nodes. I have increased the verbosity level and restarted the slurmctld and now I am getting more information about this: Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817 with slurmdbd. Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit msg: No error Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit msg: No error Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again. The problem seems to somehow be related to slurmdbd? I am a bit lost at this point, to be honest. Best,Bruno On 29 November 2017 at 14:06, Barbara Krašovec <barbara.kraso...@ijs.si> wrote: Hello, does munge work?Try if decode works locally:munge -n | unmungeTry if decode works remotely:munge -n | ssh <somehost_in_cluster> unmunge It seems as munge keys do not match... See comments inline.. On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com> wrote: I actually just managed to figure that one out. The problem was that I had setup AccountingStoragePass=magic in the slurm.conf file while after re-reading the documentation it seems this is only needed if I have a different munge instance controlling the logins to the database, which I don't. So commenting that line out seems to have worked however I am now getting a different error: Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 with slurmdbd. Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: Something happened with the receiving/processing of the persistent connection init message to localhost:6819: Initial RPC not DBD_INIT Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed state. Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 'exit-code'. My slurm.conf looks like this # LOGGING AND ACCOUNTING AccountingStorageHost=localhos t AccountingStorageLoc=slurm_db #AccountingStoragePass=magic #AccountingStoragePort= AccountingStorageType=accounti ng_storage/slurmdbd AccountingStorageUser=slurm AccountingStoreJobComment=YES ClusterName=research JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gath er/none SlurmctldDebug=3 SlurmdDebug=3 You only need:AccountingStorageEnforce=assoc iations,limits,qosAccountingStorageHost=<hostnam e>AccountingStorageType=accounti ng_storage/slurmdbd You can remove AccountingStorageLoc and AccountingStorageUser. And the slurdbd.conf like this: ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=no #ArchiveTXN=no #ArchiveUsage=no # Authentication info AuthType=auth/munge AuthInfo=/var/run/munge/munge. socket.2 #Database info # slurmDBD info DbdAddr=plantae DbdHost=plantae # Database info StorageType=accounting_storage /mysql StorageHost=localhost SlurmUser=slurm StoragePass=magic StorageUser=slurm StorageLoc=slurm_db Thank you very much in advance. Best,Bruno Cheers,Barbara On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com> wrote: It looks like you don't have the munged daemon running. On 11/29/2017 08:01 AM, Bruno Santos wrote: Hi everyone, I have set-up slurm to use slurm_db and all was working fine. However I had to change the slurm.conf to play with user priority and upon restarting the slurmctl is fails with the following messages below. It seems that somehow is trying to use the mysql password as a munge socket? Any idea how to solve it? Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 with slurmdbd. Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10 Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket communication error Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819 Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10 Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket communication error Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819 Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart with --num-threads=10 Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed to access "magic": No such file or directory Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket communication error Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: failed to send persistent connection init message to localhost:6819 Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit msg: Protocol authentication error Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have any association data from your database. The priority/multifactor plugin requires this information to run correctly. Please check your database connection and try again. Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed state. Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result 'exit-code'.