Re: [slurm-users] Upgrade woes

Ole Holm Nielsen Thu, 31 May 2018 00:03:19 -0700

Hi Lachlan,

Slurm upgrades on CentOS 7.5 should run without problems. It seems tome that your problems are unrelated to the Slurm RPMs. FWIW, Idocumented the Munge and Slurm installation as well as upgrade processin my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation


Hope this helps.
/Ole

On 05/31/2018 07:39 AM, Lachlan Musicman wrote:

After last night's announcement, I decided to start the upgrade process.
Build went fine - once I worked out where munge went - and installationalso seemed fine.
slurmctld won't restart though.

In the logs I'm seeing:
[2018-05-31T15:20:50.810] debug: Munge encode failed: Failed to access"xxxxxxxx": No such file or directory (retrying ...)
[2018-05-31T15:20:50.824] debug:  Recovered 4 tres
[2018-05-31T15:20:50.825] debug:  Recovered 3 users
[2018-05-31T15:20:50.825] debug:  Recovered 0 resources
[2018-05-31T15:20:50.825] debug:  Recovered 1 qos
[2018-05-31T15:20:50.825] debug:  Recovered 8 associations
[2018-05-31T15:20:50.872] fatal: You are running with a database but forsome reason we have less TRES than should be here (4 < 5) and/or the"billing" TRES is missing. This should only happen if the database isdown after an upgrade.
The first issue is that
debug: Munge encode failed: Failed to access "xxxxxx": No such file ordirectory (retrying ...)
contains the password in clear text ("xxxxx"). This is doubly confusing- "failed to access" would indicate it meant to have the database name(StorageLoc) rather than the database password (StoragePass). If it ismeant to be using the password, I don't think it should be clear textand (in my mind) the language should be clearer.
The second issue is that slurmctld.service wont start. The last errorshown above
fatal: You are running with a database but for some reason we have lessTRES than should be here (4 < 5) and/or the "billing" TRES is missing.This should only happen if the database is down after an upgrade.
Has a couple of hits in Google - an unanswered email from January
https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ

and a bug report
https://bugs.schedmd.com/show_bug.cgi?id=4579
which seems to have solved a slightly different but similar problem. Thefix suggested in that bug report doesn't work: using MariaDB_server5.2.x my tres_table didn't have gres in it anyway.
+---------------+---------+------+----------------+------+
| creation_time | deleted | id   | type           | name |
+---------------+---------+------+----------------+------+
|    1527744028 |       0 |    1 | cpu            |      |
|    1527744028 |       0 |    2 | mem            |      |
|    1527744028 |       0 |    3 | energy         |      |
|    1527744028 |       0 |    4 | node           |      |
|    1527744028 |       0 |    5 | billing        |      |
|    1527744028 |       1 | 1000 | dynamic_offset |      |
+---------------+---------+------+----------------+------+


No idea what to try next. Any hints would be appreciated.
Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped theslurmdbd db and restarted it from empty when the bug report didn't work)

Re: [slurm-users] Upgrade woes

Reply via email to