I am currently facing an issue with an old install of slurm (17.02.11).
However, I cannot upgrade this version because I had troubles with
database migration in the past (when upgrading to 17.11) and this
install is set to be replaced in the next coming monthes. For the time
being I have to keep it running because some of our services still rely
on it.
This issue occured after a power outage.
slurmctld is up and running, however, when I enter "sinfo", I end up
with this message after a few minutes:
slurm_load_partitions: Unable to contact slurm controller (connect
failure)
I set SlurmctldDebug=7 in slurm.conf and DebugLevel=7 in slurmdbd.conf,
however I don't get much info about any specific error that would
prevent the slurm controller from working in the logs.
Any help would be greatly appreciated.
/var/log/slurm-llnl/slurmctld.log:
[2022-07-19T15:17:58.342] debug3: Version in assoc_usage header is 7936
[2022-07-19T15:17:58.345] debug3: Version in qos_usage header is 7936
[2022-07-19T15:17:58.345] debug: Reading slurm.conf file:
/etc/slurm-llnl/slurm.conf
[2022-07-19T15:17:58.347] debug: Ignoring obsolete SchedulerPort option.
[2022-07-19T15:17:58.347] debug3: layouts: layouts_init()...
[2022-07-19T15:17:58.347] layouts: no layout to initialize
[2022-07-19T15:17:58.347] debug3: Trying to load plugin
/usr/local/lib/slurm/topology_none.so
[2022-07-19T15:17:58.347] topology NONE plugin loaded
[2022-07-19T15:17:58.347] debug3: Success.
[2022-07-19T15:17:58.348] debug: No DownNodes
[2022-07-19T15:17:58.348] debug3: Version in last_conf_lite header is 7936
[2022-07-19T15:17:58.349] debug3: Trying to load plugin
/usr/local/lib/slurm/jobcomp_none.so
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.349] debug3: Trying to load plugin
/usr/local/lib/slurm/sched_backfill.so
[2022-07-19T15:17:58.349] sched: Backfill scheduler plugin loaded
[2022-07-19T15:17:58.349] debug3: Success.
[2022-07-19T15:17:58.350] debug3: Trying to load plugin
/usr/local/lib/slurm/route_default.so
[2022-07-19T15:17:58.350] route default plugin loaded
[2022-07-19T15:17:58.350] debug3: Success.
[2022-07-19T15:17:58.355] layouts: loading entities/relations information
[2022-07-19T15:17:58.355] debug3: layouts: loading node node0
[2022-07-19T15:17:58.356] debug3: layouts: loading node node1
[2022-07-19T15:17:58.356] debug3: layouts: loading node node2
[2022-07-19T15:17:58.356] debug3: layouts: loading node node3
[2022-07-19T15:17:58.356] debug3: layouts: loading node node4
[2022-07-19T15:17:58.356] debug3: layouts: loading node node5
[2022-07-19T15:17:58.356] debug3: layouts: loading node node6
[2022-07-19T15:17:58.356] debug3: layouts: loading node node7
[2022-07-19T15:17:58.356] debug3: layouts: loading node node8
[2022-07-19T15:17:58.356] debug3: layouts: loading node node9
[2022-07-19T15:17:58.356] debug3: layouts: loading node node10
[2022-07-19T15:17:58.356] debug3: layouts: loading node node11
[2022-07-19T15:17:58.356] debug3: layouts: loading node node12
[2022-07-19T15:17:58.356] debug3: layouts: loading node node13
[2022-07-19T15:17:58.356] debug3: layouts: loading node node14
[2022-07-19T15:17:58.356] debug3: layouts: loading node node15
[2022-07-19T15:17:58.356] debug3: layouts: loading node node16
[2022-07-19T15:17:58.356] debug3: layouts: loading node node17
[2022-07-19T15:17:58.356] debug3: layouts: loading node node18
[2022-07-19T15:17:58.356] debug3: layouts: loading node node19
[2022-07-19T15:17:58.356] debug3: layouts: loading node node20
[2022-07-19T15:17:58.356] debug3: layouts: loading node node21
[2022-07-19T15:17:58.356] debug3: layouts: loading node node22
[2022-07-19T15:17:58.356] debug3: layouts: loading node node23
[2022-07-19T15:17:58.356] debug3: layouts: loading node node24
[2022-07-19T15:17:58.356] debug3: layouts: loading node node25
[2022-07-19T15:17:58.356] debug3: layouts: loading node node26
[2022-07-19T15:17:58.356] debug3: layouts: loading node node27
[2022-07-19T15:17:58.356] debug3: layouts: loading node node28
[2022-07-19T15:17:58.356] debug3: layouts: loading node node29
[2022-07-19T15:17:58.356] debug3: layouts: loading node node30
[2022-07-19T15:17:58.356] debug3: layouts: loading node node31
[2022-07-19T15:17:58.356] debug3: layouts: loading node node42
[2022-07-19T15:17:58.356] debug3: layouts: loading node node43
[2022-07-19T15:17:58.356] debug3: layouts: loading node node44
[2022-07-19T15:17:58.356] debug3: layouts: loading node node45
[2022-07-19T15:17:58.356] debug3: layouts: loading node node46
[2022-07-19T15:17:58.356] debug3: layouts: loading node node47
[2022-07-19T15:17:58.356] debug3: layouts: loading node node49
[2022-07-19T15:17:58.356] debug3: layouts: loading node node50
[2022-07-19T15:17:58.356] debug3: layouts: loading node node51
[2022-07-19T15:17:58.356] debug3: layouts: loading node node52
[2022-07-19T15:17:58.356] debug3: layouts: loading node node53
[2022-07-19T15:17:58.356] debug3: layouts: loading node node54
[2022-07-19T15:17:58.356] debug3: layouts: loading node node55
[2022-07-19T15:17:58.356] debug3: layouts: loading node node56
[2022-07-19T15:17:58.356] debug3: layouts: loading node node60
[2022-07-19T15:17:58.356] debug3: layouts: loading node node61
[2022-07-19T15:17:58.356] debug3: layouts: loading node node62
[2022-07-19T15:17:58.356] debug3: layouts: loading node node63
[2022-07-19T15:17:58.356] debug3: layouts: loading node node64
[2022-07-19T15:17:58.356] debug3: layouts: loading node node65
[2022-07-19T15:17:58.356] debug3: layouts: loading node node66
[2022-07-19T15:17:58.356] debug3: layouts: loading node node67
[2022-07-19T15:17:58.356] debug3: layouts: loading node node68
[2022-07-19T15:17:58.356] debug3: layouts: loading node node73
[2022-07-19T15:17:58.356] debug3: layouts: loading node node74
[2022-07-19T15:17:58.356] debug3: layouts: loading node node75
[2022-07-19T15:17:58.356] debug3: layouts: loading node node76
[2022-07-19T15:17:58.356] debug3: layouts: loading node node77
[2022-07-19T15:17:58.356] debug3: layouts: loading node node78
[2022-07-19T15:17:58.356] debug3: layouts: loading node node100
[2022-07-19T15:17:58.356] debug3: layouts: loading node node101
[2022-07-19T15:17:58.356] debug3: layouts: loading node node102
[2022-07-19T15:17:58.356] debug3: layouts: loading node node103
[2022-07-19T15:17:58.356] debug3: layouts: loading node node104
[2022-07-19T15:17:58.356] debug3: layouts: loading node node105
[2022-07-19T15:17:58.356] debug3: layouts: loading node node106
[2022-07-19T15:17:58.356] debug3: layouts: loading node node107
[2022-07-19T15:17:58.356] debug3: layouts: loading node node108
[2022-07-19T15:17:58.356] debug3: layouts: loading node node109
[2022-07-19T15:17:58.356] debug: layouts: 71/71 nodes in hash table, rc=0
[2022-07-19T15:17:58.356] debug: layouts: loading stage 1
[2022-07-19T15:17:58.356] debug: layouts: loading stage 1.1 (restore
state)
[2022-07-19T15:17:58.356] debug: layouts: loading stage 2
[2022-07-19T15:17:58.356] debug: layouts: loading stage 3
[2022-07-19T15:17:58.356] error: Node state file
/var/lib/slurm-llnl/slurmctld/node_state too small
[2022-07-19T15:17:58.356] error: NOTE: Trying backup state save file.
Information may be lost!
[2022-07-19T15:17:58.356] debug3: Version string in node_state header
is PROTOCOL_VERSION
[2022-07-19T15:17:58.357] Recovered state of 71 nodes
[2022-07-19T15:17:58.357] error: Job state file
/var/lib/slurm-llnl/slurmctld/job_state too small
[2022-07-19T15:17:58.357] error: NOTE: Trying backup state save file.
Jobs may be lost!
[2022-07-19T15:17:58.357] error: Incomplete job state save file
[2022-07-19T15:17:58.357] Recovered information about 0 jobs
[2022-07-19T15:17:58.357] cons_res: select_p_node_init
[2022-07-19T15:17:58.357] cons_res: preparing for 7 partitions
[2022-07-19T15:17:58.357] debug: Ports available for reservation
10000-30000
[2022-07-19T15:17:58.359] debug2: init_requeue_policy:
kill_invalid_depend is set to 0
[2022-07-19T15:17:58.359] debug: Updating partition uid access list
[2022-07-19T15:17:58.359] debug3: Version string in resv_state header
is PROTOCOL_VERSION
[2022-07-19T15:17:58.359] Recovered state of 0 reservations
[2022-07-19T15:17:58.359] State of 0 triggers recovered