The roadmap presentation from the SC20 Birds-of-a-Feather session is
online now:
https://slurm.schedmd.com/SC20/BoF.pdf
There is also a recording of the BoF including the Q+A session with Tim
and Danny that will remain available through the SC20 virtual platform
for the next few months. Pleas
This may be more "cargo cult" but I've advised users to add a "sleep 60" to
the end of their job scripts if they are "I/O intensive". Sometimes they
are somehow able to generate I/O in a way that slurm thinks the job is
finished, but the OS is still catching up on the I/O, and then slurm tries
to
Sure I've seen that in some of the posts here, e.g., a NAS. But in this
case it's a NFS share to the local RAID10 storage. There aren't any other
settings that deal with this to not drain a node?
On Mon, Nov 30, 2020 at 1:02 PM Paul Edmon wrote:
> That can help. Usually this happens due to lagg
That can help. Usually this happens due to laggy storage the job is
using taking time flushing the job's data. So making sure that your
storage is up, responsive, and stable will also cut these down.
-Paul Edmon-
On 11/30/2020 12:52 PM, Robert Kudyba wrote:
I've seen where this was a bug tha
I've seen where this was a bug that was fixed
https://bugs.schedmd.com/show_bug.cgi?id=3941 but this happens occasionally
still. A user cancels his/her job and a node gets drained.
UnkillableStepTimeout=120 is set in slurm.conf
Slurm 20.02.3 on Centos 7.9 running on Bright Cluster 8.2
Slurm Job_i
Hi;
Did you test munge connection? If not, would you test it like this
munge -n | ssh SRVGRIDSLURM02 unmunge
Ahmet M.
30.11.2020 14:43 tarihinde Steve Bland yazdı:
Thanks Diego
actually, nothing at all in the hosts file, did not seem to need to
modify it to see the nodes.
the differe
You showed that firewalld is off, but that doesn't really prove on
Centos7/RHEL7 that there is no firewall.
What is the output of
iptables -S
I'd also try doing
# scontrol show config | grep -i SlurmdPort
SlurmdPort = 6818
And whatever port is shown, from the compute nodes, try co
Although, in testing, even with ReturnToService set to '1', on a restart the
system sees the node has come back in the logs, but it is still classified as
down so will not take jobs until manually told otherwise
[2020-11-30T10:33:05.402] debug2: node_did_resp SRVGRIDSLURM01
[2020-11-30T10:33:05
Thanks Chris
When I did that, they all came back.
Also found that in slurm.conf, ReturnToService was set to 0, so modified that
for now. May turn it back to 0 to see if any nodes are lost, but I assume that
will be in the log
Interestingly I had this in slurm.conf, thought that would make the
Thanks Diego
actually, nothing at all in the hosts file, did not seem to need to modify it
to see the nodes.
the different case on one of the nodes was an experiment to see if the names
were in fact case-sensitive
but all networking functions between the nodes, with say munge, all seem to work
10 matches
Mail list logo