Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Loris Bennett
Hi Byron, byron writes: > Hi Loris - about a second What is the use-case for that? Are these individual jobs or it a job array. Either way it sounds to me like a very bad idea. On our system, jobs which can start immediately because resources are available, still take a few seconds to start

Re: [slurm-users] Power saving

2022-07-28 Thread Benson Muite
On 7/28/22 18:49, Djamil Lakhdar-Hamina wrote: I am helping set up a 16 node cluster computing system, I am not a system-admin but I work for a small firm and unfortunately have to pick up needed skills fast in things I have little experience in. I am running Rocky Linux 8 on Intel Xeon Knights

Re: [slurm-users] (no subject)

2022-07-28 Thread GRANGER Nicolas
I have no experience with this, but based on my understanding of the doc, the shutdown command should be something like "ssh ${node} systemctl shutdown", and the resume "ipmitool -I lan -H ${node}-bmc -U -f password_file.txt chassis power on ". If you use libvirt for your virtual cluster, you c

Re: [slurm-users] (no subject)

2022-07-28 Thread Benson Muite
On 7/28/22 18:49, Djamil Lakhdar-Hamina wrote: I am helping set up a 16 node cluster computing system, I am not a system-admin but I work for a small firm and unfortunately have to pick up needed skills fast in things I have little experience in. I am running Rocky Linux 8 on Intel Xeon Knights

[slurm-users] (no subject)

2022-07-28 Thread Djamil Lakhdar-Hamina
I am helping set up a 16 node cluster computing system, I am not a system-admin but I work for a small firm and unfortunately have to pick up needed skills fast in things I have little experience in. I am running Rocky Linux 8 on Intel Xeon Knights Landings nodes donated by the TAAC center. We are

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread byron
Hi Loris - about a second On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett wrote: > Hi Byron, > > byron writes: > > > Hi > > > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we > occasionally (3 times in 2 months) have slurmctld hanging so we get the > following message when running

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Fulcomer, Samuel
Hi Byron, We ran into this with 20.02, and mitigated it with some kernel tuning. From our sysctl.conf: net.core.somaxconn = 2048 net.ipv4.tcp_max_syn_backlog = 8192 # prevent neighbour (aka ARP) table overflow... net.ipv4.neigh.default.gc_thresh1 = 3 net.ipv4.neigh.default.gc_thresh2 = 320

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Loris Bennett
Hi Byron, byron writes: > Hi > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 > times in 2 months) have slurmctld hanging so we get the following message > when running sinfo > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > It only seem

[slurm-users] slurmctld hanging

2022-07-28 Thread byron
Hi We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo “slurm_load_jobs error: Socket timed out on send/recv operation” It only seems to happen when one of our users runs a job