[slurm-users] Update: Rebooted Nodes & Jobs Stuck in Cleaning State

Roberts, John E. Mon, 15 Oct 2018 12:05:16 -0700

If anyone saw my first post below, just posting an update. I was able to get 
around this finally by setting the health check (nhc) to non-execute on boot. I 
told the slurmd service to only start after all of the GPFS mounts are fully 
present with a pre start script check and then and only then re-add the execute 
bit to nhc. Adds about 20 more seconds to the boot process, but it works.


The root of the issue is that before GPFS has finished mounting everything, nhc 
runs on boot and then drains the node (expected because GPFS is a little slow 
to get everything in place). This requeues the job (also expected apparently 
since v17.11.4 of Slurm). When our jobs are requeued, they get stuck in PD 
(Cleaning) state and you can't clear it without killing the job. If I could 
figure out why a job can't properly requeue itself, I can remove this small 
hack on startup.

Thanks!
John 

On 10/10/18, 4:08 PM, "Roberts, John E." <jerobe...@anl.gov> wrote:

    Hi,
    
    Hopefully this isn't an obvious fix I'm missing. We have a large number of 
KNL nodes that can get rebooted when their memory or cluster modes are changed 
by users. I never heard any complaints when running Slurm v16.05.10, but I've 
seen a number of issues since our upgrade a couple months ago to v17.11.7. Even 
when changing the modes of 1 single KNL that requires a reboot, the job will 
almost surely go from a CF state to a PD state with reason (Cleaning). None of 
our configuration changed, so I'm not sure where this is stemming from.
    
    In the below logs, it seems the following order of operations happens. 
Slurm requests the node and gives the node the state changes. The node reboots 
and switches into the correct mode. The node gets put into a "failed state" 
likely because of our nhc health checks? The node eventually gets to an idle 
state when fully up, but the job remains in Cleaning. I can't even forcefully 
resume the job, it needs to be killed. I can then delete the job and resubmit 
immediately and the job will run. So why is Slurm having trouble getting from 
PD Cleaning state to Running? Again, this wasn't previously an issue.
    
    Here is what I described above:
    
    The node is allocated:
    [2018-10-10T07:04:05.759] sched: Allocate JobID=680790 NodeList=knl-0008 
#CPUs=64 Partition=knlall
    
    The node reboots into the new configuration. It fails because of the nhc 
health check failure? This is a normal failure because GPFS mounts can take 
some time especially if a large number of nodes were just configured. Health 
check is set to 30 seconds fyi:
    [2018-10-10T07:11:13.590] update_node: node knl-0008 reason set to: NHC: 
check_fs_mount:  /blues/gpfs/proj/0 not mounted; directory /blues/gpfs/proj/0 
missing (auto-fixed)
    [2018-10-10T07:11:13.590] update_node: DRAIN/FAIL request for node knl-0008 
which is allocated and being powered up. Requeueing jobs
    
    The jobs is requeued here and this is when it's set to cleaning and never 
gets out of this state:
    [2018-10-10T07:11:13.590] requeue job 680790 due to failure of node knl-0008
    [2018-10-10T07:11:13.590] Requeuing JobID=680790 State=0x0 NodeCnt=0
    [2018-10-10T07:11:13.590] update_node: node knl-0008 state set to DRAINED*
    [2018-10-10T07:11:14.274] Node knl-0008 rebooted 85 secs ago
    [2018-10-10T07:11:14.275] _update_node_avail_features: nodes knl-0008 
available features set to: knl,cache,hybrid,flat,auto,a2a,snc2,snc4,hemi,quad
    [2018-10-10T07:11:14.275] _update_node_active_features: nodes knl-0008 
active features set to: knl,cache,quad
    
    Node starts getting back to normal:
    [2018-10-10T07:11:14.275] Node knl-0008 now responding
    [2018-10-10T07:11:15.768] Job 680790 no longer waiting for node boot
    [2018-10-10T07:11:29.842] update_node: node knl-0008 reason set to: NHC: 
check_fs_mount:  /blues/gpfs/proj/0 not mounted
    [2018-10-10T07:11:29.842] update_node: node knl-0008 state set to DRAINED
    [2018-10-10T07:12:10.033] error: Nodes knl-0008 not responding
    [2018-10-10T07:12:29.137] update_node: node knl-0008 reason set to: NHC: 
check_fs_mount:  /blues/gpfs/group/2 not mounted
    [2018-10-10T07:12:29.137] update_node: node knl-0008 state set to DRAINED
    
    Node finally becomes idle now that it passed all of its checks:
    [2018-10-10T07:13:19.622] update_node: node knl-0008 state set to IDLE
    
    Have to kill and resubmit the job because it's stuck in cleaning:
    [2018-10-10T07:14:05.702] _slurm_rpc_kill_job: REQUEST_KILL_JOB job 680790 
uid 4688
    [2018-10-10T07:14:05.703] _job_signal: of pending JobID=680790 State=0x4 
NodeCnt=0 successful
    
    New job runs successfully:
    [2018-10-10T07:14:10.259] sched: Allocate JobID=680795 NodeList=knl-0008 
#CPUs=64 Partition=knlall
    
    I appreciate any feedback. 
    
    Thanks!
    --
    John Roberts
    HPC Systems Administrator

[slurm-users] Update: Rebooted Nodes & Jobs Stuck in Cleaning State

Reply via email to