This is definitely a NVML thing crashing slurmstepd. Here is what I find
doing an strace of the slurmstepd: [3681401.0] process at the point the
crash happens:
[pid 1132920] fcntl(10, F_SETFD, FD_CLOEXEC) = 0
[pid 1132920] read(10, "1132950 (bash) S 1132919 1132950"..., 511) = 339
[pid 11329
1.1. I never had the chance to do
this before :-0
As we have support contract I would open a ticket.
> -Original Message-
> From: slurm-users On Behalf Of
> Ole Holm Nielsen
> Sent: Tuesday, 30 January 2024 10:04
> To: slurm-users@lists.schedmd.com
> Subject: Re: [slurm-users
I built 23.02.7 and tried that and had the same problems.
BTW, I am using the slurm.spec rpm build method (built on Rocky 8 boxes
with NVIDIA 535.54.03 proprietary drives installed).
The behavior I was seeing was one would start a GPU job. It was fine at
first but at some point the slurmste
On 1/30/24 09:36, Fokke Dijkstra wrote:
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
a completing state and slurmd daemons can't be killed because they are
left in a CLOSE-WAIT state. See my previous mail to the mailing list for
the details. And also https://bugs.sc
We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in
a completing state and slurmd daemons can't be killed because they are left
in a CLOSE-WAIT state. See my previous mail to the mailing list for the
details. And also https://bugs.schedmd.com/show_bug.cgi?id=18561 for
another
Some more info on what I am seeing after the 23.11.3 upgrade.
Here is a case where a job is cancelled but seems permanently
stuck in 'CG' state in squeue
[2024-01-28T17:34:11.002] debug3: sched: JobId=3679903 initiated
[2024-01-28T17:34:11.002] sched: Allocate JobId=3679903 NodeList=rtx-06
#CPUs
I finally had downtime on our cluster running 20.11.3 and decided to
upgrade SLURM. All daemons were stopped on nodes and master.
Rocky 8 Linux OS was updated but not changed configuration-wise
in anyway.
On the master, when I first installed 23.11.1 and tried to run
slurmdbd -D -vvv at the co