Hi Lachlan,
Slurm upgrades on CentOS 7.5 should run without problems. It seems to
me that your problems are unrelated to the Slurm RPMs. FWIW, I
documented the Munge and Slurm installation as well as upgrade process
in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation
Hope this helps.
/Ole
On 05/31/2018 07:39 AM, Lachlan Musicman wrote:
After last night's announcement, I decided to start the upgrade process.
Build went fine - once I worked out where munge went - and installation
also seemed fine.
slurmctld won't restart though.
In the logs I'm seeing:
[2018-05-31T15:20:50.810] debug: Munge encode failed: Failed to access
"xxxxxxxx": No such file or directory (retrying ...)
[2018-05-31T15:20:50.824] debug: Recovered 4 tres
[2018-05-31T15:20:50.825] debug: Recovered 3 users
[2018-05-31T15:20:50.825] debug: Recovered 0 resources
[2018-05-31T15:20:50.825] debug: Recovered 1 qos
[2018-05-31T15:20:50.825] debug: Recovered 8 associations
[2018-05-31T15:20:50.872] fatal: You are running with a database but for
some reason we have less TRES than should be here (4 < 5) and/or the
"billing" TRES is missing. This should only happen if the database is
down after an upgrade.
The first issue is that
debug: Munge encode failed: Failed to access "xxxxxx": No such file or
directory (retrying ...)
contains the password in clear text ("xxxxx"). This is doubly confusing
- "failed to access" would indicate it meant to have the database name
(StorageLoc) rather than the database password (StoragePass). If it is
meant to be using the password, I don't think it should be clear text
and (in my mind) the language should be clearer.
The second issue is that slurmctld.service wont start. The last error
shown above
fatal: You are running with a database but for some reason we have less
TRES than should be here (4 < 5) and/or the "billing" TRES is missing.
This should only happen if the database is down after an upgrade.
Has a couple of hits in Google - an unanswered email from January
https://groups.google.com/d/msg/slurm-users/iZsSVlqQAyE/rKiSWihyEQAJ
and a bug report
https://bugs.schedmd.com/show_bug.cgi?id=4579
which seems to have solved a slightly different but similar problem. The
fix suggested in that bug report doesn't work: using MariaDB_server
5.2.x my tres_table didn't have gres in it anyway.
+---------------+---------+------+----------------+------+
| creation_time | deleted | id | type | name |
+---------------+---------+------+----------------+------+
| 1527744028 | 0 | 1 | cpu | |
| 1527744028 | 0 | 2 | mem | |
| 1527744028 | 0 | 3 | energy | |
| 1527744028 | 0 | 4 | node | |
| 1527744028 | 0 | 5 | billing | |
| 1527744028 | 1 | 1000 | dynamic_offset | |
+---------------+---------+------+----------------+------+
No idea what to try next. Any hints would be appreciated.
Running on CentOS 7.5, upgrading from 17.02.8 (and I dropped the
slurmdbd db and restarted it from empty when the bug report didn't work)