Step back from slurm and confirm that MariaDb is up and responsive.
# mysql -uroot -pEnter password: Welcome to the MariaDB monitor. Commands end
with ; or \g.Your MariaDB connection id is 8Server version: 10.2.9-MariaDB
MariaDB Server
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> select table_schema, table_name from
information_schema.tables;
On Wednesday, November 29, 2017 10:17 AM, Bruno Santos
<[email protected]> wrote:
Hi Barbara,
This is a fresh install. I have installed slurm from source on Debian stretch
and now trying to set it up correctly. MariaDB is running for but I am confused
about the database configuration. I followed a tutorial (I can no longer find
it) that showed me how to create the database and give it to the slurm user on
mysql. Haven't really done anything further than that as running anything
return the same errors:
root@plantae:~# sacctmgr show user -s
sacctmgr: error: slurm_persist_conn_open: Something happened with the
receiving/processing of the persistent connection init message to
localhost:6819: Initial RPC not DBD_INIT
sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
sacctmgr: error: slurm_persist_conn_open: Something happened with the
receiving/processing of the persistent connection init message to
localhost:6819: Initial RPC not DBD_INIT
sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
sacctmgr: error: slurm_persist_conn_open: Something happened with the
receiving/processing of the persistent connection init message to
localhost:6819: Initial RPC not DBD_INIT
sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error
Problem with query.
On 29 November 2017 at 14:46, Barbara Krašovec <[email protected]> wrote:
Did you upgrade SLURM or is it a fresh install?
Are there any associations set? For instance, did you create the cluster with
sacctmgr?sacctmgr add cluster <name>
Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a
simple test, such as:sacctmgr show user -sIf it was an upgrade, did you try to
run the slurmdbd and slurmctld manuallly first:
slurmdbd -Dvvvvv
Then controller:
slurmctld -Dvvvvv
Which OS is that?Is there a firewall/selinux/ACLs?
Cheers,Barbara
On 29 Nov 2017, at 15:19, Bruno Santos <[email protected]> wrote:
Thank you Barbara,
Unfortunately, it does not seem to be a munge problem. Munge can successfully
authenticate with the nodes.
I have increased the verbosity level and restarted the slurmctld and now I am
getting more information about this:
Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817
with slurmdbd.
Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
Something happened with the receiving/processing of the persistent connection
init message to localhost:6819: Initial RPC not DBD_INIT
Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit
msg: No error
Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
Something happened with the receiving/processing of the persistent connection
init message to localhost:6819: Initial RPC not DBD_INIT
Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit
msg: No error
Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have any
association data from your database. The priority/multifactor plugin requires
this information to run correctly. Please check your database connection and
try again.
The problem seems to somehow be related to slurmdbd? I am a bit lost at this
point, to be honest.
Best,Bruno
On 29 November 2017 at 14:06, Barbara Krašovec <[email protected]> wrote:
Hello,
does munge work?Try if decode works locally:munge -n | unmungeTry if decode
works remotely:munge -n | ssh <somehost_in_cluster> unmunge
It seems as munge keys do not match...
See comments inline..
On 29 Nov 2017, at 14:40, Bruno Santos <[email protected]> wrote:
I actually just managed to figure that one out.
The problem was that I had setup AccountingStoragePass=magic in the slurm.conf
file while after re-reading the documentation it seems this is only needed if I
have a different munge instance controlling the logins to the database, which I
don't. So commenting that line out seems to have worked however I am now
getting a different error:
Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817
with slurmdbd.
Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open:
Something happened with the receiving/processing of the persistent connection
init message to localhost:6819: Initial RPC not DBD_INIT
Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited,
code=exited, status=1/FAILURE
Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed
state.
Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result
'exit-code'.
My slurm.conf looks like this
# LOGGING AND ACCOUNTING
AccountingStorageHost=localhos t
AccountingStorageLoc=slurm_db
#AccountingStoragePass=magic
#AccountingStoragePort=
AccountingStorageType=accounti ng_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreJobComment=YES
ClusterName=research
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gath er/none
SlurmctldDebug=3
SlurmdDebug=3
You only need:AccountingStorageEnforce=assoc
iations,limits,qosAccountingStorageHost=<hostnam
e>AccountingStorageType=accounti ng_storage/slurmdbd
You can remove AccountingStorageLoc and AccountingStorageUser.
And the slurdbd.conf like this:
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
#ArchiveTXN=no
#ArchiveUsage=no
# Authentication info
AuthType=auth/munge
AuthInfo=/var/run/munge/munge. socket.2
#Database info
# slurmDBD info
DbdAddr=plantae
DbdHost=plantae
# Database info
StorageType=accounting_storage /mysql
StorageHost=localhost
SlurmUser=slurm
StoragePass=magic
StorageUser=slurm
StorageLoc=slurm_db
Thank you very much in advance.
Best,Bruno
Cheers,Barbara
On 29 November 2017 at 13:28, Andy Riebs <[email protected]> wrote:
It looks like you don't have the munged daemon running.
On 11/29/2017 08:01 AM, Bruno Santos wrote:
Hi everyone,
I have set-up slurm to use slurm_db and all was working fine. However I had
to change the slurm.conf to play with user priority and upon restarting the
slurmctl is fails with the following messages below. It seems that somehow is
trying to use the mysql password as a munge socket? Any idea how to solve it?
Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817
with slurmdbd.
Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart with
--num-threads=10
Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed
to access "magic": No such file or directory
Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
communication error
Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open:
failed to send persistent connection init message to localhost:6819
Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit
msg: Protocol authentication error
Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart with
--num-threads=10
Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed
to access "magic": No such file or directory
Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket
communication error
Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open:
failed to send persistent connection init message to localhost:6819
Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit
msg: Protocol authentication error
Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart with
--num-threads=10
Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed
to access "magic": No such file or directory
Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket
communication error
Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open:
failed to send persistent connection init message to localhost:6819
Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending PersistInit
msg: Protocol authentication error
Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have any
association data from your database. The priority/multifactor plugin requires
this information to run correctly. Please check your database connection and
try again.
Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited,
code=exited, status=1/FAILURE
Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed
state.
Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result
'exit-code'.