Did you upgrade SLURM or is it a fresh install?

Are there any associations set? For instance, did you create the cluster with 
sacctmgr?
sacctmgr add cluster <name>

Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a 
simple test, such as:
sacctmgr show user -s
If it was an upgrade, did you try to run the slurmdbd and slurmctld manuallly 
first:

slurmdbd -Dvvvvv

Then controller:

slurmctld -Dvvvvv

Which OS is that?
Is there a firewall/selinux/ACLs?

Cheers,
Barbara


> On 29 Nov 2017, at 15:19, Bruno Santos <bacmsan...@gmail.com> wrote:
> 
> Thank you Barbara,
> 
> Unfortunately, it does not seem to be a munge problem. Munge can successfully 
> authenticate with the nodes.
> 
> I have increased the verbosity level and restarted the slurmctld and now I am 
> getting more information about this:
> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817 
> with slurmdbd.
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: 
> Something happened with the receiving/processing of the persistent connection 
> init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending 
> PersistInit msg: No error
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: 
> Something happened with the receiving/processing of the persistent connection 
> init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending 
> PersistInit msg: No error
> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have 
> any association data from your database.  The priority/multifactor plugin 
> requires this information to run correctly.  Please check your database 
> connection and try again.
> 
> The problem seems to somehow be related to slurmdbd?
> I am a bit lost at this point, to be honest.
> 
> Best,
> Bruno
> 
> On 29 November 2017 at 14:06, Barbara Krašovec <barbara.kraso...@ijs.si 
> <mailto:barbara.kraso...@ijs.si>> wrote:
> Hello,
> 
> does munge work?
> Try if decode works locally:
> munge -n | unmunge
> Try if decode works remotely:
> munge -n | ssh <somehost_in_cluster> unmunge
> 
> It seems as munge keys do not match...
> 
> See comments inline..
> 
>> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com 
>> <mailto:bacmsan...@gmail.com>> wrote:
>> 
>> I actually just managed to figure that one out.
>> 
>> The problem was that I had setup AccountingStoragePass=magic in the 
>> slurm.conf file while after re-reading the documentation it seems this is 
>> only needed if I have a different munge instance controlling the logins to 
>> the database, which I don't.
>> So commenting that line out seems to have worked however I am now getting a 
>> different error:
>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 
>> with slurmdbd.
>> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: 
>> Something happened with the receiving/processing of the persistent 
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, 
>> code=exited, status=1/FAILURE
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed 
>> state.
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 
>> 'exit-code'.
>> 
>> My slurm.conf looks like this
>> # LOGGING AND ACCOUNTING
>> AccountingStorageHost=localhost
>> AccountingStorageLoc=slurm_db
>> #AccountingStoragePass=magic
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageUser=slurm
>> AccountingStoreJobComment=YES
>> ClusterName=research
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> SlurmctldDebug=3
>> SlurmdDebug=3
> 
> You only need:
> AccountingStorageEnforce=associations,limits,qos
> AccountingStorageHost=<hostname>
> AccountingStorageType=accounting_storage/slurmdbd
> 
> You can remove AccountingStorageLoc and AccountingStorageUser.
> 
> 
>> 
>> And the slurdbd.conf like this:
>> ArchiveEvents=yes
>> ArchiveJobs=yes
>> ArchiveResvs=yes
>> ArchiveSteps=no
>> #ArchiveTXN=no
>> #ArchiveUsage=no
>> # Authentication info
>> AuthType=auth/munge
>> AuthInfo=/var/run/munge/munge.socket.2
>> #Database info
>> # slurmDBD info
>> DbdAddr=plantae
>> DbdHost=plantae
>> # Database info
>> StorageType=accounting_storage/mysql
>> StorageHost=localhost
>> SlurmUser=slurm
>> StoragePass=magic
>> StorageUser=slurm
>> StorageLoc=slurm_db
>> 
>> 
>> Thank you very much in advance.
>> 
>> Best,
>> Bruno
> 
> Cheers,
> Barbara
> 
>> 
>> 
>> On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com 
>> <mailto:andy.ri...@hpe.com>> wrote:
>> It looks like you don't have the munged daemon running.
>> 
>> 
>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>> Hi everyone,
>>> 
>>> I have set-up slurm to use slurm_db and all was working fine. However I had 
>>> to change the slurm.conf to play with user priority and upon restarting the 
>>> slurmctl is fails with the following messages below. It seems that somehow 
>>> is trying to use the mysql password as a munge socket?
>>> Any idea how to solve it?
>>> 
>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 
>>> 6817 with slurmdbd.
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart 
>>> with --num-threads=10
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: 
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket 
>>> communication error
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>>> failed to send persistent connection init message to localhost:6819
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending 
>>> PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart 
>>> with --num-threads=10
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: 
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket 
>>> communication error
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>>> failed to send persistent connection init message to localhost:6819
>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending 
>>> PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart 
>>> with --num-threads=10
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: 
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket 
>>> communication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>>> failed to send persistent connection init message to localhost:6819
>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending 
>>> PersistInit msg: Protocol authentication error
>>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have 
>>> any association data from your database.  The priority/multifactor plugin 
>>> requires this information to run correctly.  Please check your database 
>>> connection and try again.
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, 
>>> code=exited, status=1/FAILURE
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed 
>>> state.
>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result 
>>> 'exit-code'.
>>> 
>>> 
>> 
>> 
> 
> 

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to