Hello,

I have got a node in "drain" state after finishing a job that was running on 
it. Log in node reports this information:

[...]
[2025-09-07T11:09:26.980] task/affinity: task_p_slurmd_batch_request: 
task_p_slurmd_batch_request: 59238
[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU input mask 
for node: 0xFFF
[2025-09-07T11:09:26.980] task/affinity: batch_bind: job 59238 CPU final HW 
mask for node: 0xFFF
[2025-09-07T11:09:26.980] Launching batch job 59238 for UID 21310
[2025-09-07T11:09:27.006] cred/munge: init: Munge credential signature plugin 
loaded
[2025-09-07T11:09:27.007] [59238.batch] debug:  auth/munge: init: loaded
[2025-09-07T11:09:27.009] [59238.batch] debug:  Reading cgroup.conf file 
/soft/slurm-23.11.0/etc/cgroup.conf
[2025-09-07T11:09:27.025] [59238.batch] debug:  cgroup/v1: init: Cgroup v1 
plugin loaded
[2025-09-07T11:09:27.025] [59238.batch] debug:  hash/k12: init: init: 
KangarooTwelve hash plugin loaded
[2025-09-07T11:09:27.026] [59238.batch] debug:  task/cgroup: init: core 
enforcement enabled
[2025-09-07T11:09:27.026] [59238.batch] debug:  task/cgroup: init: device 
enforcement enabled
[2025-09-07T11:09:27.026] [59238.batch] debug:  task/cgroup: init: Tasks 
containment cgroup plugin loaded
[2025-09-07T11:09:27.026] [59238.batch] task/affinity: init: task affinity 
plugin loaded with CPU mask 0xfff
[2025-09-07T11:09:27.027] [59238.batch] debug:  jobacct_gather/cgroup: init: 
Job accounting gather cgroup plugin loaded
[2025-09-07T11:09:27.027] [59238.batch] topology/default: init: topology 
Default plugin loaded
[2025-09-07T11:09:27.030] [59238.batch] debug:  gpu/generic: init: init: GPU 
Generic plugin loaded
[2025-09-07T11:09:27.031] [59238.batch] debug:  laying out the 12 tasks on 1 
hosts clus09 dist 2
[2025-09-07T11:09:27.031] [59238.batch] debug:  close_slurmd_conn: sending 0: 
No error
[2025-09-07T11:09:27.031] [59238.batch] debug:  Message thread started pid = 
910040
[2025-09-07T11:09:27.031] [59238.batch] debug:  Setting slurmstepd(910040) 
oom_score_adj to -1000
[2025-09-07T11:09:27.031] [59238.batch] debug:  spank: opening plugin stack 
/soft/slurm-23.11.0/etc/plugstack.conf
[2025-09-07T11:09:27.031] [59238.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: job abstract cores are '0-11'
[2025-09-07T11:09:27.031] [59238.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: step abstract cores are '0-11'
[2025-09-07T11:09:27.031] [59238.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: job physical CPUs are '0-11'
[2025-09-07T11:09:27.031] [59238.batch] debug:  task/cgroup: 
task_cgroup_cpuset_create: step physical CPUs are '0-11'
[2025-09-07T11:09:27.090] [59238.batch] debug levels are stderr='error', 
logfile='debug', syslog='fatal'
[2025-09-07T11:09:27.090] [59238.batch] starting 1 tasks
[2025-09-07T11:09:27.090] [59238.batch] task 0 (910044) started 
2025-09-07T11:09:27
[2025-09-07T11:09:27.098] [59238.batch] debug:  task/affinity: 
task_p_pre_launch: affinity StepId=59238.batch, task:0 bind:mask_cpu
[2025-09-07T11:09:27.098] [59238.batch] _set_limit: RLIMIT_NPROC  : reducing 
req:255366 to max:159631
[2025-09-07T11:09:27.398] [59238.batch] task 0 (910044) exited with exit code 2.
[2025-09-07T11:09:27.399] [59238.batch] debug:  task/affinity: 
task_p_post_term: affinity StepId=59238.batch, task 0
[2025-09-07T11:09:27.399] [59238.batch] debug:  signaling condition
[2025-09-07T11:09:27.399] [59238.batch] debug:  jobacct_gather/cgroup: fini: 
Job accounting gather cgroup plugin unloaded
[2025-09-07T11:09:27.400] [59238.batch] debug:  task/cgroup: fini: Tasks 
containment cgroup plugin unloaded
[2025-09-07T11:09:27.400] [59238.batch] job 59238 completed with slurm_rc = 0, 
job_rc = 512
[2025-09-07T11:09:27.410] [59238.batch] debug:  Message thread exited
[2025-09-07T11:09:27.410] [59238.batch] stepd_cleanup: done with step 
(rc[0x200]:Unknown error 512, cleanup_rc[0x0]:No error)
[2025-09-07T11:09:27.411] debug:  _rpc_terminate_job: uid = 1000 JobId=59238
[2025-09-07T11:09:27.411] debug:  credential for job 59238 revoked
[...]



"sinfo" shows:

[root@login-node ~]# sinfo
    PARTITION     TIMELIMIT      AVAIL      STATE NODELIST                      
           CPU_LOAD   NODES(A/I) NODES(A/I/O/T)       CPUS  CPUS(A/I/O/T) REASON
      node.q*       4:00:00         up    drained clus09                        
           0.00              0/0        0/0/1/1         12      0/0/12/12 Kill 
task faile
      node.q*       4:00:00         up  allocated clus[10-11]                   
           13.82-15.8        2/0        2/0/0/2         12      24/0/0/24 none
      node.q*       4:00:00         up       idle clus[01-06,12]                
           0.00              0/7        0/7/0/7         12      0/84/0/84 none


But it seems there is no error in node... Slurmctld.log in server seems 
correct, too.

Is there any way to reset node to "state=idle" after errors in the same way?

Thanks.
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to