Hello, does munge work? Try if decode works locally: munge -n | unmunge Try if decode works remotely: munge -n | ssh <somehost_in_cluster> unmunge
It seems as munge keys do not match... See comments inline.. > On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com> wrote: > > I actually just managed to figure that one out. > > The problem was that I had setup AccountingStoragePass=magic in the > slurm.conf file while after re-reading the documentation it seems this is > only needed if I have a different munge instance controlling the logins to > the database, which I don't. > So commenting that line out seems to have worked however I am now getting a > different error: > Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 > with slurmdbd. > Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: > Something happened with the receiving/processing of the persistent connection > init message to localhost:6819: Initial RPC not DBD_INIT > Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, > code=exited, status=1/FAILURE > Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed > state. > Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result > 'exit-code'. > > My slurm.conf looks like this > # LOGGING AND ACCOUNTING > AccountingStorageHost=localhost > AccountingStorageLoc=slurm_db > #AccountingStoragePass=magic > #AccountingStoragePort= > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageUser=slurm > AccountingStoreJobComment=YES > ClusterName=research > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/none > SlurmctldDebug=3 > SlurmdDebug=3 You only need: AccountingStorageEnforce=associations,limits,qos AccountingStorageHost=<hostname> AccountingStorageType=accounting_storage/slurmdbd You can remove AccountingStorageLoc and AccountingStorageUser. > > And the slurdbd.conf like this: > ArchiveEvents=yes > ArchiveJobs=yes > ArchiveResvs=yes > ArchiveSteps=no > #ArchiveTXN=no > #ArchiveUsage=no > # Authentication info > AuthType=auth/munge > AuthInfo=/var/run/munge/munge.socket.2 > #Database info > # slurmDBD info > DbdAddr=plantae > DbdHost=plantae > # Database info > StorageType=accounting_storage/mysql > StorageHost=localhost > SlurmUser=slurm > StoragePass=magic > StorageUser=slurm > StorageLoc=slurm_db > > > Thank you very much in advance. > > Best, > Bruno Cheers, Barbara > > > On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com > <mailto:andy.ri...@hpe.com>> wrote: > It looks like you don't have the munged daemon running. > > > On 11/29/2017 08:01 AM, Bruno Santos wrote: >> Hi everyone, >> >> I have set-up slurm to use slurm_db and all was working fine. However I had >> to change the slurm.conf to play with user priority and upon restarting the >> slurmctl is fails with the following messages below. It seems that somehow >> is trying to use the mysql password as a munge socket? >> Any idea how to solve it? >> >> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 >> with slurmdbd. >> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart >> with --num-threads=10 >> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed >> to access "magic": No such file or directory >> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket >> communication error >> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: >> failed to send persistent connection init message to localhost:6819 >> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending >> PersistInit msg: Protocol authentication error >> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart >> with --num-threads=10 >> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed >> to access "magic": No such file or directory >> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket >> communication error >> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: >> failed to send persistent connection init message to localhost:6819 >> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending >> PersistInit msg: Protocol authentication error >> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart >> with --num-threads=10 >> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed >> to access "magic": No such file or directory >> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket >> communication error >> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: >> failed to send persistent connection init message to localhost:6819 >> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending >> PersistInit msg: Protocol authentication error >> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have >> any association data from your database. The priority/multifactor plugin >> requires this information to run correctly. Please check your database >> connection and try again. >> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, >> code=exited, status=1/FAILURE >> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed >> state. >> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result >> 'exit-code'. >> >> > >
signature.asc
Description: Message signed with OpenPGP