Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
This is definitely a NVML thing crashing slurmstepd. Here is what I find doing an strace of the slurmstepd: [3681401.0] process at the point the crash happens: [pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0 [pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339 [pid 11329

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Heckes, Frank
1.1. I never had the chance to do this before :-0 As we have support contract I would open a ticket. > -Original Message- > From: slurm-users On Behalf Of > Ole Holm Nielsen > Sent: Tuesday, 30 January 2024 10:04 > To: slurm-users@lists.schedmd.com > Subject: Re: [slurm-users

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Paul Raines
I built 23.02.7 and tried that and had the same problems. BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes with NVIDIA 535.54.03 proprietary drives installed). The behavior I was seeing was one would start a GPU job. It was fine at first but at some point the slurmste

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Ole Holm Nielsen
On 1/30/24 09:36, Fokke Dijkstra wrote: We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.sc

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Fokke Dijkstra
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 for another

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
Some more info on what I am seeing after the 23.11.3 upgrade. Here is a case where a job is cancelled but seems permanently stuck in 'CG' state in squeue [2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated [2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06 #CPUs

[slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-28 Thread Paul Raines
I finally had downtime on our cluster running 20.11.3 and decided to upgrade SLURM. All daemons were stopped on nodes and master. Rocky 8 Linux OS was updated but not changed configuration-wise in anyway. On the master, when I first installed 23.11.1 and tried to run slurmdbd -D -vvv at the co