date:20201130

[slurm-users] Slurm SC20 Birds-of-a-Feather presentation online

2020-11-30 Thread Tim Wickberg

The roadmap presentation from the SC20 Birds-of-a-Feather session is online now: https://slurm.schedmd.com/SC20/BoF.pdf There is also a recording of the BoF including the Q+A session with Tim and Danny that will remain available through the SC20 virtual platform for the next few months. Pleas

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Alex Chekholko

This may be more "cargo cult" but I've advised users to add a "sleep 60" to the end of their job scripts if they are "I/O intensive". Sometimes they are somehow able to generate I/O in a way that slurm thinks the job is finished, but the OS is still catching up on the I/O, and then slurm tries to

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba

Sure I've seen that in some of the posts here, e.g., a NAS. But in this case it's a NFS share to the local RAID10 storage. There aren't any other settings that deal with this to not drain a node? On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon wrote: > That can help. Usually this happens due to lagg

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Paul Edmon

That can help. Usually this happens due to laggy storage the job is using taking time flushing the job's data. So making sure that your storage is up, responsive, and stable will also cut these down. -Paul Edmon- On 11/30/2020 12:52 PM, Robert Kudyba wrote: I've seen where this was a bug tha

[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

2020-11-30 Thread Robert Kudyba

I've seen where this was a bug that was fixed https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally still. A user cancels his/her job and a node gets drained. UnkillableStepTimeout=120 is set in slurm.conf Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2 Slurm Job_i

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread mercan

Hi; Did you test munge connection? If not, would you test it like this munge -n | ssh SRVGRIDSLURM02 unmunge Ahmet M. 30.11.2020 14:43 tarihinde Steve Bland yazdı: Thanks Diego actually, nothing at all in the hosts file, did not seem to need to modify it to see the nodes. the differe

Re: [slurm-users] [EXT] Re: [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Sean Crosby

You showed that firewalld is off, but that doesn't really prove on Centos7/RHEL7 that there is no firewall. What is the output of iptables -S I'd also try doing # scontrol show config | grep -i SlurmdPort SlurmdPort = 6818 And whatever port is shown, from the compute nodes, try co

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Steve Bland

Although, in testing, even with ReturnToService set to '1', on a restart the system sees the node has come back in the logs, but it is still classified as down so will not take jobs until manually told otherwise [2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM01 [2020-11-30T10:33:05

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Steve Bland

Thanks Chris When I did that, they all came back. Also found that in slurm.conf, ReturnToService was set to 0, so modified that for now. May turn it back to 0 to see if any nodes are lost, but I assume that will be in the log Interestingly I had this in slurm.conf, thought that would make the

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Steve Bland

Thanks Diego actually, nothing at all in the hosts file, did not seem to need to modify it to see the nodes. the different case on one of the nodes was an experiment to see if the names were in fact case-sensitive but all networking functions between the nodes, with say munge, all seem to work

[slurm-users] Slurm SC20 Birds-of-a-Feather presentation online

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Re: [slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

[slurm-users] Kill task failed, state set to DRAINING, UnkillableStepTimeout=120

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Re: [slurm-users] [EXT] Re: [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

10 matches

Site Navigation

Mail list logo

Footer information