Hello,

we are running a SLURM-managed cluster with one control node (g-vm03) and 26 worker nodes (ouga[03-28]) on Rocky 8. We recently updated from 20.11.9 through 23.02.8 to 24.11.0 and then 24.11.5. Since then, we are experiencing performance issues - squeue and scontrol ping are slow to react and sometimes deliver "timeout on send/recv" messages, even with only very few parallel requests. We did not experience these issues with SLURM 20.11.9 before, we did not check the intermediate version 23.02.8 in detail before. In the log of slurmctld, we can also find messages like

slurmctld: error: slurm_send_node_msg: [socket:[1272743]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error

We thus implemented all recommendations from the high throughput documentation, and did achieve improvements with it (most notably by increasing the maximum number of open files and increasing MessageTimeout and TCPTimeout).

For debugging, I attached the slurm.conf, the sdiag output (the server thread count is almost always 1 and sometimes increases to 2), the slurmctld log and the slurmdbd log from a time of high load.

We would be very thankful for any input on how restore the old performance.

Kind Regards,
Tilman Hoffbauer

#
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
# cluster master and backup 
#
ControlMachine=g-vm03
ControlAddr=g-vm03.cmm.in.tum.de

MailProg="/bin/mail"
MpiDefault="none"
MpiParams=ports=55000-60000
ProctrackType="proctrack/cgroup"

##
# Rebooting nodes
##
# return after reboot
ReturnToService=2
# reboot a node once it becomes idle
RebootProgram="/sbin/reboot"
# raise the timeout for rebooting to 10 mins
ResumeTimeout=600

##
# Health check
##
HealthCheckProgram=/etc/slurm/healthcheck.sh
# check every 10 mins
HealthCheckInterval=600

# authentication
AuthType=auth/munge
CryptoType=crypto/munge

# job control

##
# slurm run files and communication ports
##
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd/
SlurmUser=slurm
SrunPortRange=60001-63000

##
# slurm task switches
###
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup,task/affinity
#TaskPluginParam=

# send group id's as listed on the job-spawning node
LaunchParameters=send_gids

#
#
# TIMERS
# increase timeout until 'Kill Task Failed' is thrown
UnkillableStepTimeout=300

# slurmctld connection management
MessageTimeout=60
TCPTimeout=30
SlurmctldParameters=cloud_dns,conmgr_max_connections=64,conmgr_threads=8

#
#
# SCHEDULING
EnforcePartLimits="yes"
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
DefMemPerCPU=4000

#
#
# LOGGING AND ACCOUNTING
ClusterName=ag_gagneur
AccountingStorageType="accounting_storage/slurmdbd"
AccountingStorageHost="g-vm03"
AccountingStoragePort="6819"
AccountingStorageTRES=gres/gpu
JobAcctGatherType=jobacct_gather/cgroup

###
# Submitting jobs
###
MinJobAge=300

###
# Slurm Debuging infos
###
SlurmctldDebug="error"
SlurmctldLogFile=/var/log/slurmd/slurmctl.log
SlurmdDebug="info"
SlurmdLogFile=/var/log/slurmd/slurmd.log
#DebugFlags="CPU_Bind,gres"

StateSaveLocation="/var/spool/slurmd/"

####
# Slurm Prolog and Epilog scripts
####
Epilog="/etc/slurm/epilog.d/*.sh"
TaskProlog="/etc/slurm/TaskProlog.sh"

####
# Types of Generic Resources
####
GresTypes=gpu,mps,tmp

PriorityType = priority/multifactor
PreemptMode = REQUEUE
PreemptType = preempt/partition_prio

PriorityWeightAge=100
PriorityWeightTRES=GRES/gpu=1000
PriorityWeightPartition=1000

# COMPUTE NODES
NodeName="ouga03"                       CPUs=64  RealMemory=256000      
CoresPerSocket=8        ThreadsPerCore=2 Weight=10          State=UNKNOWN       
Feature=sse2,sse4_1,sse4_2,avx
NodeName="ouga04"                       CPUs=80  RealMemory=512000      
CoresPerSocket=10       ThreadsPerCore=2 Weight=20          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga05"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=100         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:titanrtx:1
NodeName="ouga06"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=400         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a6000:4
NodeName="ouga07"                       CPUs=256 RealMemory=1024000     
CoresPerSocket=64       ThreadsPerCore=2 Weight=200         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:2
NodeName="ouga08"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=200         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:rtx3090:1
NodeName="ouga09"                       CPUs=32  RealMemory=61440       
CoresPerSocket=8        ThreadsPerCore=2 Weight=10          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga10"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=400         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:4
NodeName="ouga11"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=400         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:4
NodeName="ouga12"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=800         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:8
NodeName="ouga13"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=800         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:8
NodeName="ouga14"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=800         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:8
NodeName="ouga15"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=400         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:4
NodeName="ouga16"                       CPUs=256 RealMemory=1024000     
CoresPerSocket=64       ThreadsPerCore=2 Weight=400         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:4
NodeName="ouga17"                       CPUs=256 RealMemory=1024000     
CoresPerSocket=64       ThreadsPerCore=2 Weight=400         State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:a40:4
NodeName="ouga18"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga19"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga20"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga21"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga22"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga23"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga24"                       CPUs=128 RealMemory=512000      
CoresPerSocket=64       ThreadsPerCore=2 Weight=30          State=UNKNOWN       
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2
NodeName="ouga25"                       CPUs=128 RealMemory=512000      
CoresPerSocket=32       ThreadsPerCore=2 Weight=2000    State=UNKNOWN   
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:l40s:4
NodeName="ouga26"                       CPUs=128 RealMemory=512000      
CoresPerSocket=32       ThreadsPerCore=2 Weight=2000    State=UNKNOWN   
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:l40s:4
NodeName="ouga27"                       CPUs=128 RealMemory=512000      
CoresPerSocket=32       ThreadsPerCore=2 Weight=2000    State=UNKNOWN   
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:l40s:4
NodeName="ouga28"                       CPUs=240 RealMemory=1536000     
CoresPerSocket=60       ThreadsPerCore=2 Weight=5000    State=UNKNOWN   
Feature=fma,sse2,sse4_1,sse4_2,avx,avx2 Gres=gpu:h200:8

# PARTITIONS 
PartitionName=lowprio         Nodes="ouga[03-27]"                               
        Default="NO"  PriorityTier=5  PreemptMode="REQUEUE" MaxTime="INFINITE" 
State="UP" OverSubscribe="NO" AllowGroups="cluster_access"
PartitionName=noninterruptive Nodes="ouga[03-10],ouga[12-14],ouga[24-26]"       
        Default="NO"  PriorityTier=10 PreemptMode="off"     MaxTime="INFINITE" 
State="UP" OverSubscribe="NO" AllowGroups="cluster_access"
PartitionName=standard        Nodes="ouga[03-27]"                               
        Default="YES" PriorityTier=10 PreemptMode="REQUEUE" MaxTime="INFINITE" 
State="UP" OverSubscribe="NO" AllowGroups="cluster_access"
PartitionName=urgent          Nodes="ouga[03-28]"                               
        Default="NO"  PriorityTier=20 PreemptMode="REQUEUE" MaxTime="1-0"      
State="UP" OverSubscribe="NO" AllowGroups="cluster_access"
PartitionName=highperformance Nodes="ouga[25-28]"                               
        Default="NO"  PriorityTier=15 PreemptMode="REQUEUE" MaxTime="INFINITE" 
State="UP" OverSubscribe="NO" AllowGroups="cluster_access"
PartitionName=jupyterhub      
Nodes="ouga[05-08],ouga[10-11],ouga[14-22],ouga24"        Default="NO"  
PriorityTier=30 PreemptMode="off"     MaxTime="0-12"     State="UP" 
OverSubscribe="NO" AllowGroups="cluster_access"
[2025-05-13T14:25:29.845] error: Unable to open pidfile `/run/slurmdbd.pid': Permission denied
[2025-05-13T14:25:29.852] accounting_storage/as_mysql: _check_mysql_concat_is_sane: MySQL server version is: 5.5.5-10.3.39-MariaDB
[2025-05-13T14:25:29.852] error: Database settings not recommended values: innodb_buffer_pool_size
[2025-05-13T14:25:29.891] slurmdbd version 24.11.5 started
[2025-05-13T14:25:32.558] Stack size set to 134217728
[2025-05-13T14:25:32.562] slurmctld version 24.11.5 started on cluster ag_gagneur(2175)
[2025-05-13T14:25:32.565] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
[2025-05-13T14:25:32.577] Recovered state of 26 nodes
[2025-05-13T14:25:32.583] Recovered JobId=17115039 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115039 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115050 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115050 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115051 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115051 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115052 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115052 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115054 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115054 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115057 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115057 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115059 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115059 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115062 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115062 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115070 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115070 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115073 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115073 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115078 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115078 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115082 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115082 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115083 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115083 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115084 StepId=batch
[2025-05-13T14:25:32.583] Recovered JobId=17115084 Assoc=0
[2025-05-13T14:25:32.583] Recovered JobId=17115086 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115086 Assoc=0
[2025-05-13T14:25:32.584] Recovered JobId=17115088 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115088 Assoc=0
[2025-05-13T14:25:32.584] Recovered JobId=17115100 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115100 Assoc=0
[2025-05-13T14:25:32.584] Recovered JobId=17115101 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115101 Assoc=0
[2025-05-13T14:25:32.584] Recovered JobId=17115103 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115103 Assoc=0
[2025-05-13T14:25:32.584] Recovered JobId=17115104 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115104 Assoc=0
[2025-05-13T14:25:32.584] Recovered JobId=17115105 StepId=batch
[2025-05-13T14:25:32.584] Recovered JobId=17115105 Assoc=0
[2025-05-13T14:25:32.584] Recovered information about 21 jobs
[2025-05-13T14:25:32.584] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 8 partitions
[2025-05-13T14:25:32.591] Recovered state of 0 reservations
[2025-05-13T14:25:32.591] read_slurm_conf: backup_controller not specified
[2025-05-13T14:25:32.591] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2025-05-13T14:25:32.591] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 8 partitions
[2025-05-13T14:25:32.591] Running as primary controller
[2025-05-13T14:31:24.114] _slurm_rpc_submit_batch_job: JobId=17115127 InitPrio=1000 usec=738
[2025-05-13T14:31:25.000] sched: Allocate JobId=17115127 NodeList=ouga03 #CPUs=32 Partition=urgent
[2025-05-13T14:39:11.263] _slurm_rpc_submit_batch_job: JobId=17115128 InitPrio=1000 usec=399
[2025-05-13T14:39:12.000] sched: Allocate JobId=17115128 NodeList=ouga04 #CPUs=50 Partition=urgent
[2025-05-13T14:41:34.763] Batch JobId=17115127 missing from batch node ouga03 (not found BatchStartTime after startup), Requeuing job
[2025-05-13T14:41:34.763] _job_complete: JobId=17115127 WTERMSIG 1
[2025-05-13T14:41:34.763] _job_complete: requeue JobId=17115127 due to node failure
[2025-05-13T14:41:34.765] _job_complete: JobId=17115127 done
[2025-05-13T14:41:36.249] Batch JobId=17115128 missing from batch node ouga04 (not found BatchStartTime after startup), Requeuing job
[2025-05-13T14:41:36.249] _job_complete: JobId=17115128 WTERMSIG 1
[2025-05-13T14:41:36.249] _job_complete: requeue JobId=17115128 due to node failure
[2025-05-13T14:41:36.249] _job_complete: JobId=17115128 done
[2025-05-13T14:41:38.197] error: slurm_send_node_msg: [socket:[1323098]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:38.763] error: slurm_send_node_msg: [socket:[1323107]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:40.513] error: slurm_send_node_msg: [socket:[1321736]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:46.407] error: slurm_send_node_msg: [socket:[1321748]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:48.450] error: slurm_send_node_msg: [socket:[1322421]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:49.143] error: slurm_send_node_msg: [socket:[1322430]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:54.463] error: slurm_send_node_msg: [socket:[1324106]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:56.225] error: slurm_send_node_msg: [socket:[1323180]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:56.370] error: slurm_send_node_msg: [socket:[1321791]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:41:56.694] error: slurm_send_node_msg: [socket:[1321796]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:02.768] error: slurm_send_node_msg: [socket:[1324165]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:03.512] error: slurm_send_node_msg: [socket:[1321817]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:04.012] error: slurm_send_node_msg: [socket:[1321823]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:06.012] Requeuing JobId=17115127
[2025-05-13T14:42:06.513] error: slurm_send_node_msg: [socket:[1323248]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:11.716] error: slurm_send_node_msg: [socket:[1323259]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:12.932] error: slurm_send_node_msg: [socket:[1322526]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:13.440] error: slurm_send_node_msg: [socket:[1324261]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:14.430] Requeuing JobId=17115128
[2025-05-13T14:42:15.621] error: slurm_send_node_msg: [socket:[1324298]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:20.767] error: slurm_send_node_msg: [socket:[1321860]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:22.012] error: slurm_send_node_msg: [socket:[1324338]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:22.512] error: slurm_send_node_msg: [socket:[1324432]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:23.899] error: slurm_send_node_msg: [socket:[1322695]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:31.922] error: slurm_send_node_msg: [socket:[1323485]] slurm_bufs_sendto(msg_type=RESPONSE_PARTITION_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:32.142] error: slurm_send_node_msg: [socket:[1324520]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:32.413] error: slurm_send_node_msg: [socket:[1325063]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:33.013] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=17115052 uid 9934
[2025-05-13T14:42:38.637] error: slurm_send_node_msg: [socket:[1325138]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:38.637] error: slurm_send_node_msg: [socket:[1325153]] slurm_bufs_sendto(msg_type=RESPONSE_PARTITION_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:39.695] error: slurm_send_node_msg: [socket:[1323592]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:40.704] error: slurm_send_node_msg: [socket:[1324594]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:40.957] error: slurm_send_node_msg: [socket:[1323614]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:46.265] _slurm_rpc_submit_batch_job: JobId=17115129 InitPrio=1000 usec=368
[2025-05-13T14:42:47.000] sched: Allocate JobId=17115129 NodeList=ouga22 #CPUs=32 Partition=jupyterhub
[2025-05-13T14:42:47.735] error: slurm_send_node_msg: [socket:[1323652]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:48.698] error: slurm_send_node_msg: [socket:[1324635]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:49.262] error: slurm_send_node_msg: [socket:[1323668]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:50.012] error: slurm_send_node_msg: [socket:[1322859]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:56.393] error: slurm_send_node_msg: [socket:[1322886]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:57.186] error: slurm_send_node_msg: [socket:[1322896]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:57.385] error: slurm_send_node_msg: [socket:[1323704]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:42:58.689] _update_job: setting comment to 59081 for JobId=17115129
[2025-05-13T14:42:58.689] _slurm_rpc_update_job: complete JobId=17115129 uid=9934 usec=146
[2025-05-13T14:42:59.406] error: slurm_send_node_msg: [socket:[1325301]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:05.669] error: slurm_send_node_msg: [socket:[1324704]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:05.929] error: slurm_send_node_msg: [socket:[1323767]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:06.365] error: slurm_send_node_msg: [socket:[1322985]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:13.768] error: slurm_send_node_msg: [socket:[1323835]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:14.513] error: slurm_send_node_msg: [socket:[1326106]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:15.012] error: slurm_send_node_msg: [socket:[1324744]] slurm_bufs_sendto(msg_type=RESPONSE_FED_INFO) failed: Unexpected missing socket error
[2025-05-13T14:43:17.263] error: slurm_send_node_msg: [socket:[1325575]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error
*******************************************************
sdiag output at Tue May 13 14:26:28 2025 (1747139188)
Data since      Tue May 13 14:25:54 2025 (1747139154)
*******************************************************
Server thread count:  1
RPC queue enabled:    0
Agent queue size:     0
Agent count:          0
Agent thread count:   0
DBD Agent queue size: 0

Jobs submitted: 0
Jobs started:   0
Jobs completed: 0
Jobs canceled:  0
Jobs failed:    0

Job states ts:  Tue May 13 14:26:02 2025 (1747139162)
Jobs pending:   0
Jobs running:   21

Main schedule statistics (microseconds):
        Last cycle:   10
        Max cycle:    0
        Total cycles: 0
        Last queue length: 0

Main scheduler exit:
        End of job queue: 0
        Hit default_queue_depth: 0
        Hit sched_max_job_start: 0
        Blocked on licenses: 0
        Hit max_rpc_cnt: 0
        Timeout (max_sched_time): 0

Backfilling stats
        Total backfilled jobs (since last slurm start): 0
        Total backfilled jobs (since last stats cycle start): 0
        Total backfilled heterogeneous job components: 0
        Total cycles: 0
        Last cycle when: Tue May 13 13:53:57 2025 (1747137237)
        Last cycle: 0
        Max cycle:  0
        Last depth cycle: 0
        Last depth cycle (try sched): 0
        Last queue length: 0
        Last table size: 0

Backfill exit
        End of job queue: 0
        Hit bf_max_job_start: 0
        Hit bf_max_job_test: 0
        System state changed: 0
        Hit table size limit (bf_node_space_size): 0
        Timeout (bf_max_time): 0

Latency for 1000 calls to gettimeofday(): 22 microseconds

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:28     
ave_time:130    total_time:3649
        REQUEST_JOB_INFO                        ( 2003) count:28     
ave_time:211    total_time:5927
        REQUEST_STATS_INFO                      ( 2035) count:7      
ave_time:52     total_time:364

Remote Procedure Call statistics by user
        hoti            (   31288) count:44     ave_time:182    total_time:8050
        root            (       0) count:11     ave_time:86     total_time:946
        slurm           (     440) count:6      ave_time:94     total_time:569
        hingerl         (   18390) count:2      ave_time:187    total_time:375

Pending RPC statistics
        No pending RPCs
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to