Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node?
On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon <ped...@cfa.harvard.edu> wrote: > That can help. Usually this happens due to laggy storage the job is > using taking time flushing the job's data. So making sure that your > storage is up, responsive, and stable will also cut these down. > > -Paul Edmon- > > On 11/30/2020 12:52 PM, Robert Kudyba wrote: > > I've seen where this was a bug that was fixed > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e= > > > < > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e= > > but this happens > > occasionally still. A user cancels his/her job and a node gets > > drained. UnkillableStepTimeout=120 is set in slurm.conf > > > > Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 > > > > Slurm Job_id=6908 Name=run.sh Ended, Run time 7-17:50:36, CANCELLED, > > ExitCode 0 > > Resending TERMINATE_JOB request JobId=6908 Nodelist=node001 > > update_node: node node001 reason set to: Kill task failed > > update_node: node node001 state set to DRAINING > > error: slurmd error running JobId=6908 on node(s)=node001: Kill task > > failed > > > > update_node: node node001 reason set to: hung > > update_node: node node001 state set to DOWN > > update_node: node node001 state set to IDLE > > error: Nodes node001 not responding > > > > scontrol show config | grep kill > > UnkillableStepProgram = (null) > > UnkillableStepTimeout = 120 sec > > > > Do we just increase the timeout value? > >