Hi Byron,
byron writes:
> Hi Loris - about a second
What is the use-case for that? Are these individual jobs or it a job
array. Either way it sounds to me like a very bad idea. On our system,
jobs which can start immediately because resources are available, still
take a few seconds to start
On 7/28/22 18:49, Djamil Lakhdar-Hamina wrote:
I am helping set up a 16 node cluster computing system, I am not a
system-admin but I work for a small firm and unfortunately have to pick
up needed skills fast in things I have little experience in. I am
running Rocky Linux 8 on Intel Xeon Knights
I have no experience with this, but based on my understanding of the doc, the
shutdown command should be something like "ssh ${node} systemctl shutdown", and
the resume "ipmitool -I lan -H ${node}-bmc -U -f password_file.txt
chassis power on ".
If you use libvirt for your virtual cluster, you c
On 7/28/22 18:49, Djamil Lakhdar-Hamina wrote:
I am helping set up a 16 node cluster computing system, I am not a
system-admin but I work for a small firm and unfortunately have to pick
up needed skills fast in things I have little experience in. I am
running Rocky Linux 8 on Intel Xeon Knights
I am helping set up a 16 node cluster computing system, I am not a
system-admin but I work for a small firm and unfortunately have to pick up
needed skills fast in things I have little experience in. I am running
Rocky Linux 8 on Intel Xeon Knights Landings nodes donated by the TAAC
center. We are
Hi Loris - about a second
On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett
wrote:
> Hi Byron,
>
> byron writes:
>
> > Hi
> >
> > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running
Hi Byron,
We ran into this with 20.02, and mitigated it with some kernel tuning. From
our sysctl.conf:
net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192
# prevent neighbour (aka ARP) table overflow...
net.ipv4.neigh.default.gc_thresh1 = 3
net.ipv4.neigh.default.gc_thresh2 = 320
Hi Byron,
byron writes:
> Hi
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3
> times in 2 months) have slurmctld hanging so we get the following message
> when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seem
Hi
We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
(3 times in 2 months) have slurmctld hanging so we get the following
message when running sinfo
“slurm_load_jobs error: Socket timed out on send/recv operation”
It only seems to happen when one of our users runs a job