Managed to do some more progress on this. The problem seems to be related to somehow the service still linking to an older version of slurmdbd I had installed with apt. I have now hopefully fully cleaned the old version but when I try to start the service it is getting killed somehow. Any suggestions?
[2017-11-29T16:15:16.778] debug3: Trying to load plugin > /usr/local/lib/slurm/auth_munge.so > [2017-11-29T16:15:16.778] debug: Munge authentication plugin loaded > [2017-11-29T16:15:16.778] debug3: Success. > [2017-11-29T16:15:16.778] debug3: Trying to load plugin > /usr/local/lib/slurm/accounting_storage_mysql.so > [2017-11-29T16:15:16.780] debug2: mysql_connect() called for db slurm_db > [2017-11-29T16:15:16.786] adding column federation after flags in table > cluster_table > [2017-11-29T16:15:16.786] adding column fed_id after federation in table > cluster_table > [2017-11-29T16:15:16.786] adding column fed_state after fed_id in table > cluster_table > [2017-11-29T16:15:16.786] adding column fed_weight after fed_state in > table cluster_table > [2017-11-29T16:15:16.786] debug: Table cluster_table has changed. > Updating... > [2017-11-29T16:15:17.259] debug: Table txn_table has changed. Updating... > [2017-11-29T16:15:17.781] debug: Table tres_table has changed. > Updating... > [2017-11-29T16:15:18.325] debug: Table acct_coord_table has changed. > Updating... > [2017-11-29T16:15:18.783] debug: Table acct_table has changed. > Updating... > [2017-11-29T16:15:19.252] debug: Table res_table has changed. Updating... > [2017-11-29T16:15:20.267] debug: Table clus_res_table has changed. > Updating... > [2017-11-29T16:15:20.762] debug: Table qos_table has changed. Updating... > [2017-11-29T16:15:21.272] debug: Table user_table has changed. > Updating... > [2017-11-29T16:15:22.079] Accounting storage MYSQL plugin loaded > [2017-11-29T16:15:22.080] debug3: Success. > [2017-11-29T16:15:22.083] debug2: ArchiveDir = /tmp > [2017-11-29T16:15:22.083] debug2: ArchiveScript = (null) > [2017-11-29T16:15:22.083] debug2: AuthInfo = (null) > [2017-11-29T16:15:22.083] debug2: AuthType = auth/munge > [2017-11-29T16:15:22.083] debug2: CommitDelay = 0 > [2017-11-29T16:15:22.083] debug2: DbdAddr = 10.1.10.37 > [2017-11-29T16:15:22.083] debug2: DbdBackupHost = (null) > [2017-11-29T16:15:22.083] debug2: DbdHost = plantae > [2017-11-29T16:15:22.083] debug2: DbdPort = 6819 > [2017-11-29T16:15:22.083] debug2: DebugFlags = (null) > [2017-11-29T16:15:22.083] debug2: DebugLevel = 7 > [2017-11-29T16:15:22.083] debug2: DefaultQOS = (null) > [2017-11-29T16:15:22.083] debug2: LogFile = > /slurm/log/slurmdbd.log > [2017-11-29T16:15:22.083] debug2: MessageTimeout = 10 > [2017-11-29T16:15:22.083] debug2: PidFile = > /slurm/run/slurmdbd.pid > [2017-11-29T16:15:22.083] debug2: PluginDir = /usr/local/lib/slurm > [2017-11-29T16:15:22.083] debug2: PrivateData = none > [2017-11-29T16:15:22.083] debug2: PurgeEventAfter = NONE [2017-11-29T16:15:22.083] debug2: PurgeJobAfter = NONE > [2017-11-29T16:15:22.083] debug2: PurgeResvAfter = NONE > [2017-11-29T16:15:22.083] debug2: PurgeStepAfter = NONE > [2017-11-29T16:15:22.083] debug2: PurgeSuspendAfter = NONE > [2017-11-29T16:15:22.083] debug2: PurgeTXNAfter = NONE > [2017-11-29T16:15:22.083] debug2: PurgeUsageAfter = NONE > [2017-11-29T16:15:22.083] debug2: SlurmUser = slurm(64030) > [2017-11-29T16:15:22.083] debug2: StorageBackupHost = (null) > [2017-11-29T16:15:22.083] debug2: StorageHost = localhost > [2017-11-29T16:15:22.083] debug2: StorageLoc = slurm_db > [2017-11-29T16:15:22.083] debug2: StoragePort = 3306 > [2017-11-29T16:15:22.083] debug2: StorageType = > accounting_storage/mysql > [2017-11-29T16:15:22.083] debug2: StorageUser = slurm > [2017-11-29T16:15:22.083] debug2: TCPTimeout = 2 > [2017-11-29T16:15:22.083] debug2: TrackWCKey = 0 > [2017-11-29T16:15:22.083] debug2: TrackSlurmctldDown= 0 > [2017-11-29T16:15:22.083] debug2: acct_storage_p_get_connection: request > new connection 1 > [2017-11-29T16:15:22.086] slurmdbd version 17.02.9 started > [2017-11-29T16:15:22.086] debug2: running rollup at Wed Nov 29 16:15:22 > 2017 > [2017-11-29T16:15:22.086] debug2: Everything rolled up > [2017-11-29T16:16:46.798] Terminate signal (SIGINT or SIGTERM) received > [2017-11-29T16:16:46.798] debug: rpc_mgr shutting down > [2017-11-29T16:16:46.799] debug3: starting mysql cleaning up > [2017-11-29T16:16:46.799] debug3: finished mysql cleaning up On 29 November 2017 at 15:13, Bruno Santos <bacmsan...@gmail.com> wrote: > Hi Barbara, > > This is a fresh install. I have installed slurm from source on Debian > stretch and now trying to set it up correctly. > MariaDB is running for but I am confused about the database configuration. > I followed a tutorial (I can no longer find it) that showed me how to > create the database and give it to the slurm user on mysql. Haven't really > done anything further than that as running anything return the same errors: > > root@plantae:~# sacctmgr show user -s >> sacctmgr: error: slurm_persist_conn_open: Something happened with the >> receiving/processing of the persistent connection init message to >> localhost:6819: Initial RPC not DBD_INIT >> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error >> sacctmgr: error: slurm_persist_conn_open: Something happened with the >> receiving/processing of the persistent connection init message to >> localhost:6819: Initial RPC not DBD_INIT >> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error >> sacctmgr: error: slurm_persist_conn_open: Something happened with the >> receiving/processing of the persistent connection init message to >> localhost:6819: Initial RPC not DBD_INIT >> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error >> sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error >> Problem with query. > > > > > On 29 November 2017 at 14:46, Barbara Krašovec <barbara.kraso...@ijs.si> > wrote: > >> Did you upgrade SLURM or is it a fresh install? >> >> Are there any associations set? For instance, did you create the cluster >> with sacctmgr? >> sacctmgr add cluster <name> >> >> Is mariadb/mysql server running, is slurmdbd running? Is it working? Try >> a simple test, such as: >> >> sacctmgr show user -s >> >> If it was an upgrade, did you try to run the slurmdbd and slurmctld >> manuallly first: >> >> slurmdbd -Dvvvvv >> >> Then controller: >> >> slurmctld -Dvvvvv >> >> Which OS is that? >> Is there a firewall/selinux/ACLs? >> >> Cheers, >> Barbara >> >> >> On 29 Nov 2017, at 15:19, Bruno Santos <bacmsan...@gmail.com> wrote: >> >> Thank you Barbara, >> >> Unfortunately, it does not seem to be a munge problem. Munge can >> successfully authenticate with the nodes. >> >> I have increased the verbosity level and restarted the slurmctld and now >> I am getting more information about this: >> >>> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port >>>> 6817 with slurmdbd. >>> >>> Nov 29 14:08:16 plantae slurmctld[30340]: error: >>>> slurm_persist_conn_open: Something happened with the receiving/processing >>>> of the persistent connection init message to localhost:6819: Initial RPC >>>> not DBD_INIT >>> >>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending >>>> PersistInit msg: No error >>> >>> Nov 29 14:08:16 plantae slurmctld[30340]: error: >>>> slurm_persist_conn_open: Something happened with the receiving/processing >>>> of the persistent connection init message to localhost:6819: Initial RPC >>>> not DBD_INIT >>> >>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending >>>> PersistInit msg: No error >>> >>> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't >>>> have any association data from your database. The priority/multifactor >>>> plugin requires this information to run correctly. Please check your >>>> database connection and try again. >>> >>> >> The problem seems to somehow be related to slurmdbd? >> I am a bit lost at this point, to be honest. >> >> Best, >> Bruno >> >> On 29 November 2017 at 14:06, Barbara Krašovec <barbara.kraso...@ijs.si> >> wrote: >> >>> Hello, >>> >>> does munge work? >>> Try if decode works locally: >>> munge -n | unmunge >>> Try if decode works remotely: >>> munge -n | ssh <somehost_in_cluster> unmunge >>> >>> It seems as munge keys do not match... >>> >>> See comments inline.. >>> >>> On 29 Nov 2017, at 14:40, Bruno Santos <bacmsan...@gmail.com> wrote: >>> >>> I actually just managed to figure that one out. >>> >>> The problem was that I had setup AccountingStoragePass=magic in the >>> slurm.conf file while after re-reading the documentation it seems this is >>> only needed if I have a different munge instance controlling the logins to >>> the database, which I don't. >>> So commenting that line out seems to have worked however I am now >>> getting a different error: >>> >>>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port >>>> 6817 with slurmdbd. >>>> Nov 29 13:19:20 plantae slurmctld[29984]: error: >>>> slurm_persist_conn_open: Something happened with the receiving/processing >>>> of the persistent connection init message to localhost:6819: Initial RPC >>>> not DBD_INIT >>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process >>>> exited, code=exited, status=1/FAILURE >>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered >>>> failed state. >>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with >>>> result 'exit-code'. >>> >>> >>> My slurm.conf looks like this >>> >>>> # LOGGING AND ACCOUNTING >>>> AccountingStorageHost=localhost >>>> AccountingStorageLoc=slurm_db >>>> #AccountingStoragePass=magic >>>> #AccountingStoragePort= >>>> AccountingStorageType=accounting_storage/slurmdbd >>>> AccountingStorageUser=slurm >>>> AccountingStoreJobComment=YES >>>> ClusterName=research >>>> JobCompType=jobcomp/none >>>> JobAcctGatherFrequency=30 >>>> JobAcctGatherType=jobacct_gather/none >>>> SlurmctldDebug=3 >>>> SlurmdDebug=3 >>> >>> >>> You only need: >>> AccountingStorageEnforce=associations,limits,qos >>> AccountingStorageHost=<hostname> >>> AccountingStorageType=accounting_storage/slurmdbd >>> >>> You can remove AccountingStorageLoc and AccountingStorageUser. >>> >>> >>> >>> And the slurdbd.conf like this: >>> >>>> ArchiveEvents=yes >>>> ArchiveJobs=yes >>>> ArchiveResvs=yes >>>> ArchiveSteps=no >>>> #ArchiveTXN=no >>>> #ArchiveUsage=no >>>> # Authentication info >>>> AuthType=auth/munge >>>> AuthInfo=/var/run/munge/munge.socket.2 >>> >>> #Database info >>>> # slurmDBD info >>>> DbdAddr=plantae >>>> DbdHost=plantae >>>> # Database info >>>> StorageType=accounting_storage/mysql >>>> StorageHost=localhost >>>> SlurmUser=slurm >>>> StoragePass=magic >>>> StorageUser=slurm >>>> StorageLoc=slurm_db >>> >>> >>> >>> Thank you very much in advance. >>> >>> Best, >>> Bruno >>> >>> >>> Cheers, >>> Barbara >>> >>> >>> >>> On 29 November 2017 at 13:28, Andy Riebs <andy.ri...@hpe.com> wrote: >>> >>>> It looks like you don't have the munged daemon running. >>>> >>>> >>>> On 11/29/2017 08:01 AM, Bruno Santos wrote: >>>> >>>> Hi everyone, >>>> >>>> I have set-up slurm to use slurm_db and all was working fine. However I >>>> had to change the slurm.conf to play with user priority and upon restarting >>>> the slurmctl is fails with the following messages below. It seems that >>>> somehow is trying to use the mysql password as a munge socket? >>>> Any idea how to solve it? >>>> >>>> >>>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at >>>>> port 6817 with slurmdbd. >>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, >>>>> restart with --num-threads=10 >>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: >>>>> Failed to access "magic": No such file or directory >>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: >>>>> Socket communication error >>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: >>>>> slurm_persist_conn_open: failed to send persistent connection init message >>>>> to localhost:6819 >>>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending >>>>> PersistInit msg: Protocol authentication error >>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, >>>>> restart with --num-threads=10 >>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: >>>>> Failed to access "magic": No such file or directory >>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: >>>>> Socket communication error >>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: >>>>> slurm_persist_conn_open: failed to send persistent connection init message >>>>> to localhost:6819 >>>>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending >>>>> PersistInit msg: Protocol authentication error >>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, >>>>> restart with --num-threads=10 >>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: >>>>> Failed to access "magic": No such file or directory >>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: >>>>> Socket communication error >>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: >>>>> slurm_persist_conn_open: failed to send persistent connection init message >>>>> to localhost:6819 >>>>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending >>>>> PersistInit msg: Protocol authentication error >>>>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't >>>>> have any association data from your database. The priority/multifactor >>>>> plugin requires this information to run correctly. Please check your >>>>> database connection and try again. >>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process >>>>> exited, code=exited, status=1/FAILURE >>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered >>>>> failed state. >>>>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with >>>>> result 'exit-code'. >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >> >> >