Hello all.
I'm speechless.
I suspendend testing config changes to update another machine. In the
last test I added "CPUs=192" to the noe definition, restarted slurmctld
and nothing changed.
When I returned, I checked again and slurm reported 192 CPUs! Magic?
I now removed CPUs=192, restarted slurmctld and it keeps seeing all CPUs...
What should I think?
But another problem surfaces: slurmtop seems not to handle so many CPUs
gracefully and throws a lot of errors, but that should be something
manageable...
Tks for the help.
BYtE,
Diego
Il 21/07/2021 11:01, Diego Zuccato ha scritto:
Uff... A bit mangled... Correcting and resending.
Il 21/07/2021 08:18, Diego Zuccato ha scritto:
Il 20/07/2021 18:02, mercan ha scritto:
Hi Ahmet.
Did you check slurmctld log for a complain about the host line. if
the slumctld can not recognize a parameter, may be it give up
processing whole host line.
Yup. Nothing there :(
[2021-07-21T08:13:14.984] slurmctld version 18.08.5-2 started on
cluster oph
[2021-07-21T08:13:16.990] error: _shutdown_bu_thread:send/recv
str957-cluster2: Connection timed out
[2021-07-21T08:13:17.809] layouts: no layout to initialize
[2021-07-21T08:13:17.828] error: read_slurm_conf: default partition
not set.
[2021-07-21T08:13:17.829] layouts: loading entities/relations information
[2021-07-21T08:13:17.829] Recovered state of 34 nodes
[2021-07-21T08:13:17.829] Down nodes: str957-mtx-[21-22]
[2021-07-21T08:13:17.829] Recovered JobId=33656 Assoc=377
[...cut...]
[2021-07-21T08:13:17.831] Recovered information about 45 jobs
[2021-07-21T08:13:17.831] cons_res: select_p_node_init
[2021-07-21T08:13:17.831] cons_res: preparing for 8 partitions
[2021-07-21T08:13:17.832] Recovered state of 0 reservations
[2021-07-21T08:13:17.833] cons_res: select_p_reconfigure
[2021-07-21T08:13:17.833] cons_res: select_p_node_init
[2021-07-21T08:13:17.833] cons_res: preparing for 8 partitions
[2021-07-21T08:13:17.833] Running as primary controller
[2021-07-21T08:13:17.833] Registering slurmctld at port 6817 with
slurmdbd.
[2021-07-21T08:13:18.220] No parameter for mcs plugin, default values set
[2021-07-21T08:13:18.220] mcs: MCSParameters = (null). ondemand set.
[2021-07-21T08:13:23.226]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2021-07-21T08:13:23.226] _build_node_list: No nodes satisfy
JobId=33762 requirements in partition b6
[2021-07-21T08:13:23.227] _build_node_list: No nodes satisfy
JobId=33808 requirements in partition b4
(str957-cluster2 is the second frontend/login node that I've had to
take offline for an unrelated problem).
And str957-mtx-[21-22] are not yet installed.
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786