Did you upgrade SLURM or is it a fresh install? Are there any associations set? For instance, did you create the cluster with sacctmgr? sacctmgr add cluster <name>
Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a simple test, such as: sacctmgr show user -s If it was an upgrade, did you try to run the slurmdbd and slurmctld manuallly first: slurmdbd -Dvvvvv Then controller: slurmctld -Dvvvvv Which OS is that? Is there a firewall/selinux/ACLs? Cheers, Barbara > On 29 Nov 2017, at 15:19, Bruno Santos <bacmsan...@gmail.com> wrote: > > Thank you Barbara, > > Unfortunately, it does not seem to be a munge problem. Munge can successfully > authenticate with the nodes. > > I have increased the verbosity level and restarted the slurmctld and now I am > getting more information about this: > Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817 > with slurmdbd. > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: > Something happened with the receiving/processing of the persistent connection > init message to localhost:6819: Initial RPC not DBD_INIT > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending > PersistInit msg: No error > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: > Something happened with the receiving/processing of the persistent connection > init message to localhost:6819: Initial RPC not DBD_INIT > Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending > PersistInit msg: No error > Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have > any association data from your database. The priority/multifactor plugin > requires this information to run correctly. Please check your database > connection and try again. > > The problem seems to somehow be related to slurmdbd? > I am a bit lost at this point, to be honest. > > Best, > Bruno > > On 29 November 2017 at 14:06, Barbara Krašovec <barbara.kraso...@ijs.si > <mailto:barbara.kraso...@ijs.si>> wrote: > Hello, > > does munge work? > Try if decode works locally: > munge -n | unmunge > Try if decode works remotely: > munge -n | ssh <somehost_in_cluster> unmunge > > It seems as munge keys do not match... > > See comments inline.. > >> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com >> <mailto:bacmsan...@gmail.com>> wrote: >> >> I actually just managed to figure that one out. >> >> The problem was that I had setup AccountingStoragePass=magic in the >> slurm.conf file while after re-reading the documentation it seems this is >> only needed if I have a different munge instance controlling the logins to >> the database, which I don't. >> So commenting that line out seems to have worked however I am now getting a >> different error: >> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 >> with slurmdbd. >> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: >> Something happened with the receiving/processing of the persistent >> connection init message to localhost:6819: Initial RPC not DBD_INIT >> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, >> code=exited, status=1/FAILURE >> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed >> state. >> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result >> 'exit-code'. >> >> My slurm.conf looks like this >> # LOGGING AND ACCOUNTING >> AccountingStorageHost=localhost >> AccountingStorageLoc=slurm_db >> #AccountingStoragePass=magic >> #AccountingStoragePort= >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStorageUser=slurm >> AccountingStoreJobComment=YES >> ClusterName=research >> JobCompType=jobcomp/none >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct_gather/none >> SlurmctldDebug=3 >> SlurmdDebug=3 > > You only need: > AccountingStorageEnforce=associations,limits,qos > AccountingStorageHost=<hostname> > AccountingStorageType=accounting_storage/slurmdbd > > You can remove AccountingStorageLoc and AccountingStorageUser. > > >> >> And the slurdbd.conf like this: >> ArchiveEvents=yes >> ArchiveJobs=yes >> ArchiveResvs=yes >> ArchiveSteps=no >> #ArchiveTXN=no >> #ArchiveUsage=no >> # Authentication info >> AuthType=auth/munge >> AuthInfo=/var/run/munge/munge.socket.2 >> #Database info >> # slurmDBD info >> DbdAddr=plantae >> DbdHost=plantae >> # Database info >> StorageType=accounting_storage/mysql >> StorageHost=localhost >> SlurmUser=slurm >> StoragePass=magic >> StorageUser=slurm >> StorageLoc=slurm_db >> >> >> Thank you very much in advance. >> >> Best, >> Bruno > > Cheers, > Barbara > >> >> >> On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com >> <mailto:andy.ri...@hpe.com>> wrote: >> It looks like you don't have the munged daemon running. >> >> >> On 11/29/2017 08:01 AM, Bruno Santos wrote: >>> Hi everyone, >>> >>> I have set-up slurm to use slurm_db and all was working fine. However I had >>> to change the slurm.conf to play with user priority and upon restarting the >>> slurmctl is fails with the following messages below. It seems that somehow >>> is trying to use the mysql password as a munge socket? >>> Any idea how to solve it? >>> >>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port >>> 6817 with slurmdbd. >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart >>> with --num-threads=10 >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: >>> Failed to access "magic": No such file or directory >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket >>> communication error >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: >>> failed to send persistent connection init message to localhost:6819 >>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending >>> PersistInit msg: Protocol authentication error >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart >>> with --num-threads=10 >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: >>> Failed to access "magic": No such file or directory >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket >>> communication error >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: >>> failed to send persistent connection init message to localhost:6819 >>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending >>> PersistInit msg: Protocol authentication error >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart >>> with --num-threads=10 >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: >>> Failed to access "magic": No such file or directory >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket >>> communication error >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: >>> failed to send persistent connection init message to localhost:6819 >>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending >>> PersistInit msg: Protocol authentication error >>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have >>> any association data from your database. The priority/multifactor plugin >>> requires this information to run correctly. Please check your database >>> connection and try again. >>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, >>> code=exited, status=1/FAILURE >>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed >>> state. >>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result >>> 'exit-code'. >>> >>> >> >> > >
signature.asc
Description: Message signed with OpenPGP