Hi all,

I am hoping someone can help with our problem. Every hour after restarting 
slurmctld the controller becomes unresponsive to commands for 1 sec, reporting 
errors such as:

[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] 
slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error
[2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] 
slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket 
error

It occurs consistently at around the hour mark, but generally not at other 
times, unless we run a reconfigure or restart the controller. We don’t see any 
issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We 
have tried building a new server on different infrastructure, but the problem 
has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope 
that may provide a fix. During our troubleshooting we have:
Set:

  *
SchedulerParameters     = 
max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
  *
SlurmctldPort           = 6808-6817

But although the stats in sdiag have improved we still see the errors.

On our monitoring software we also see a drop in network and disk activity 
during this 1 second, always at approx. 1 hour after restarting the controller.

Many Thanks in advance

Jason

Jason Ellul
Head - Research Computing Facility
Office of Cancer Research
Peter MacCallum Cancer Centre
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to