Hello,

does munge work?
Try if decode works locally:
munge -n | unmunge
Try if decode works remotely:
munge -n | ssh <somehost_in_cluster> unmunge

It seems as munge keys do not match...

See comments inline..

> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com> wrote:
> 
> I actually just managed to figure that one out.
> 
> The problem was that I had setup AccountingStoragePass=magic in the 
> slurm.conf file while after re-reading the documentation it seems this is 
> only needed if I have a different munge instance controlling the logins to 
> the database, which I don't.
> So commenting that line out seems to have worked however I am now getting a 
> different error:
> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 
> with slurmdbd.
> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: 
> Something happened with the receiving/processing of the persistent connection 
> init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, 
> code=exited, status=1/FAILURE
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed 
> state.
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 
> 'exit-code'.
> 
> My slurm.conf looks like this
> # LOGGING AND ACCOUNTING
> AccountingStorageHost=localhost
> AccountingStorageLoc=slurm_db
> #AccountingStoragePass=magic
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> AccountingStoreJobComment=YES
> ClusterName=research
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> SlurmdDebug=3

You only need:
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=<hostname>
AccountingStorageType=accounting_storage/slurmdbd

You can remove AccountingStorageLoc and AccountingStorageUser.


> 
> And the slurdbd.conf like this:
> ArchiveEvents=yes
> ArchiveJobs=yes
> ArchiveResvs=yes
> ArchiveSteps=no
> #ArchiveTXN=no
> #ArchiveUsage=no
> # Authentication info
> AuthType=auth/munge
> AuthInfo=/var/run/munge/munge.socket.2
> #Database info
> # slurmDBD info
> DbdAddr=plantae
> DbdHost=plantae
> # Database info
> StorageType=accounting_storage/mysql
> StorageHost=localhost
> SlurmUser=slurm
> StoragePass=magic
> StorageUser=slurm
> StorageLoc=slurm_db
> 
> 
> Thank you very much in advance.
> 
> Best,
> Bruno

Cheers,
Barbara

> 
> 
> On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com 
> <mailto:andy.ri...@hpe.com>> wrote:
> It looks like you don't have the munged daemon running.
> 
> 
> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>> Hi everyone,
>> 
>> I have set-up slurm to use slurm_db and all was working fine. However I had 
>> to change the slurm.conf to play with user priority and upon restarting the 
>> slurmctl is fails with the following messages below. It seems that somehow 
>> is trying to use the mysql password as a munge socket?
>> Any idea how to solve it?
>> 
>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 
>> with slurmdbd.
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have 
>> any association data from your database.  The priority/multifactor plugin 
>> requires this information to run correctly.  Please check your database 
>> connection and try again.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, 
>> code=exited, status=1/FAILURE
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed 
>> state.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result 
>> 'exit-code'.
>> 
>> 
> 
> 

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to