Thank you Barbara, Unfortunately, it does not seem to be a munge problem. Munge can successfully authenticate with the nodes.
I have increased the verbosity level and restarted the slurmctld and now I am getting more information about this: > Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port >> 6817 with slurmdbd. > > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: >> Something happened with the receiving/processing of the persistent >> connection init message to localhost:6819: Initial RPC not DBD_INIT > > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending >> PersistInit msg: No error > > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: >> Something happened with the receiving/processing of the persistent >> connection init message to localhost:6819: Initial RPC not DBD_INIT > > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending >> PersistInit msg: No error > > Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have >> any association data from your database. The priority/multifactor plugin >> requires this information to run correctly. Please check your database >> connection and try again. > > The problem seems to somehow be related to slurmdbd? I am a bit lost at this point, to be honest. Best, Bruno On 29 November 2017 at 14:06, Barbara Krašovec <barbara.kraso...@ijs.si> wrote: > Hello, > > does munge work? > Try if decode works locally: > munge -n | unmunge > Try if decode works remotely: > munge -n | ssh <somehost_in_cluster> unmunge > > It seems as munge keys do not match... > > See comments inline.. > > On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com> wrote: > > I actually just managed to figure that one out. > > The problem was that I had setup AccountingStoragePass=magic in the > slurm.conf file while after re-reading the documentation it seems this is > only needed if I have a different munge instance controlling the logins to > the database, which I don't. > So commenting that line out seems to have worked however I am now getting > a different error: > >> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port >> 6817 with slurmdbd. >> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: >> Something happened with the receiving/processing of the persistent >> connection init message to localhost:6819: Initial RPC not DBD_INIT >> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process >> exited, code=exited, status=1/FAILURE >> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered >> failed state. >> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result >> 'exit-code'. > > > My slurm.conf looks like this > >> # LOGGING AND ACCOUNTING >> AccountingStorageHost=localhost >> AccountingStorageLoc=slurm_db >> #AccountingStoragePass=magic >> #AccountingStoragePort= >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStorageUser=slurm >> AccountingStoreJobComment=YES >> ClusterName=research >> JobCompType=jobcomp/none >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_gather/none >> SlurmctldDebug=3 >> SlurmdDebug=3 > > > You only need: > AccountingStorageEnforce=associations,limits,qos > AccountingStorageHost=<hostname> > AccountingStorageType=accounting_storage/slurmdbd > > You can remove AccountingStorageLoc and AccountingStorageUser. > > > > And the slurdbd.conf like this: > >> ArchiveEvents=yes >> ArchiveJobs=yes >> ArchiveResvs=yes >> ArchiveSteps=no >> #ArchiveTXN=no >> #ArchiveUsage=no >> # Authentication info >> AuthType=auth/munge >> AuthInfo=/var/run/munge/munge.socket.2 > > #Database info >> # slurmDBD info >> DbdAddr=plantae >> DbdHost=plantae >> # Database info >> StorageType=accounting_storage/mysql >> StorageHost=localhost >> SlurmUser=slurm >> StoragePass=magic >> StorageUser=slurm >> StorageLoc=slurm_db > > > > Thank you very much in advance. > > Best, > Bruno > > > Cheers, > Barbara > > > > On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com> wrote: > >> It looks like you don't have the munged daemon running. >> >> >> On 11/29/2017 08:01 AM, Bruno Santos wrote: >> >> Hi everyone, >> >> I have set-up slurm to use slurm_db and all was working fine. However I >> had to change the slurm.conf to play with user priority and upon restarting >> the slurmctl is fails with the following messages below. It seems that >> somehow is trying to use the mysql password as a munge socket? >> Any idea how to solve it? >> >> >>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port >>> 6817 with slurmdbd. >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, >>> restart with --num-threads=10 >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: >>> Failed to access "magic": No such file or directory >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket >>> communication error >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: >>> slurm_persist_conn_open: failed to send persistent connection init message >>> to localhost:6819 >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending >>> PersistInit msg: Protocol authentication error >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, >>> restart with --num-threads=10 >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: >>> Failed to access "magic": No such file or directory >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket >>> communication error >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: >>> slurm_persist_conn_open: failed to send persistent connection init message >>> to localhost:6819 >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending >>> PersistInit msg: Protocol authentication error >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, >>> restart with --num-threads=10 >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: >>> Failed to access "magic": No such file or directory >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket >>> communication error >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: >>> slurm_persist_conn_open: failed to send persistent connection init message >>> to localhost:6819 >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending >>> PersistInit msg: Protocol authentication error >>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't >>> have any association data from your database. The priority/multifactor >>> plugin requires this information to run correctly. Please check your >>> database connection and try again. >>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process >>> exited, code=exited, status=1/FAILURE >>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered >>> failed state. >>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with >>> result 'exit-code'. >> >> >> >> >> >> > >