Re: [slurm-users] Database cluster

2024-01-23 Thread Diego Zuccato
IIUC the database is not "critical": if it goes down, you lose access to 
some statistics. But job data gets cached anyway and the db will be 
updated when it comes back online.


Diego

Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:

Community:

What do you do to ensure database reliability in your SLURM environment?  We 
can have multiple controllers and multiple slurmdbds, but my understanding is 
that slurmdbd can be configured with a single MySQL server, so what do you do?  
Do you have that “single MySQL server” be a cluster, such as Percona XtraDB?  
Do you use MySQL replication, then manually switch to slurmdbd to a replication 
slave if the master goes down?  Do you do something else?

Thanks.

Daniel


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

2024-01-23 Thread Diego Zuccato
Also, remembre to specify the memory used by the job if you treat it as 
a TRES if you're using CR_*Memory to select resources.


Diego

Il 18/01/2024 15:44, Ümit Seren ha scritto:

This line also has tobe changed:


#SBATCH --gpus-per-node=4#SBATCH --gpus-per-node=1

--gpus-per-nodeseems to be the new parameter that is replacing the 
--gres= one, so you can remove the –gres line completely.


Best

Ümit

*From: *slurm-users  on behalf of 
Kherfani, Hafedh (Professional Services, TC) 

*Date: *Thursday, 18. January 2024 at 15:40
*To: *Slurm User Community List 
*Subject: *Re: [slurm-users] Need help with running multiple 
instances/executions of a batch script in parallel (with NVIDIA HGX A100 
GPU as a Gres)


Hi Noam and Matthias,

Thanks both for your answers.

I changed the “#SBATCH --gres=gpu:4“ directive (in the batch script) 
with “#SBATCH --gres=gpu:1“ as you suggested, but it didn’t make a 
difference, as running this batch script 3 times will result in the 
first job to be in a running state, while the second and third jobs will 
still be in a pending state …


[slurmtest@c-a100-master test-batch-scripts]$ cat gpu-job.sh

#!/bin/bash

#SBATCH --job-name=gpu-job

#SBATCH --partition=gpu

#SBATCH --nodes=1

#SBATCH --gpus-per-node=4

#SBATCH --gres=gpu:1    #  Changed from ‘4’ 
to ‘1’


#SBATCH --tasks-per-node=1

#SBATCH --output=gpu_job_output.%j

#SBATCH --error=gpu_job_error.%j

hostname

date

sleep 40

pwd

[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh

Submitted batch job *217*

[slurmtest@c-a100-master test-batch-scripts]$ squeue

  JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)


    217   gpu  gpu-job slurmtes  R   0:02  1 
c-a100-cn01


[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh

Submitted batch job *218*

[slurmtest@c-a100-master test-batch-scripts]$ sbatch gpu-job.sh

Submitted batch job *219*

[slurmtest@c-a100-master test-batch-scripts]$ squeue

  JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)


    219   gpu  gpu-job slurmtes *PD*   0:00  1 
(Priority)


    218   gpu  gpu-job slurmtes *PD*   0:00  1 
(Resources)


    217   gpu  gpu-job slurmtes *R*   0:07  1 
c-a100-cn01


Basically I’m seeking for some help/hints on how to tell Slurm, from the 
batch script for example: “I want only 1 or 2 GPUs to be used/consumed 
by the job”, and then I run the batch script/job a couple of times with 
sbatch command, and confirm that we can indeed have multiple jobs using 
a GPU and running in parallel, at the same time.


Makes sense ?

Best regards,

**

*Hafedh *

*From:*slurm-users  *On Behalf Of 
*Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)

*Sent:* jeudi 18 janvier 2024 2:30 PM
*To:* Slurm User Community List 
*Subject:* Re: [slurm-users] Need help with running multiple 
instances/executions of a batch script in parallel (with NVIDIA HGX A100 
GPU as a Gres)


On Jan 18, 2024, at 7:31 AM, Matthias Loose mailto:m.lo...@mindcode.de>> wrote:

Hi Hafedh,

Im no expert in the GPU side of SLURM, but looking at you current
configuration to me its working as intended at the moment. You have
defined 4 GPUs and start multiple jobs each consuming 4 GPUs each.
So the jobs wait for the ressource the be free again.

I think what you need to look into is the MPS plugin, which seems to
do what you are trying to achieve:
https://slurm.schedmd.com/gres.html#MPS_Management


I agree with the first paragraph.  How many GPUs are you expecting each 
job to use? I'd have assumed, based on the original text, that each job 
is supposed to use 1 GPU, and the 4 jobs were supposed to be running 
side-by-side on the one node you have (with 4 GPUs).  If so, you need to 
tell each job to request only 1 GPU, and currently each one is requesting 4.


If your jobs are actually supposed to be using 4 GPUs each, I still 
don't see any advantage to MPS (at least in what is my usual GPU usage 
pattern): all the jobs will take longer to finish, because they are 
sharing the fixed resource. If they take turns, at least the first ones 
finish as fast as they can, and the last one will finish no later than 
it would have if they were all time-sharing the GPUs.  I guess NVIDIA 
had something in mind when they developed MPS, so I guess our pattern 
may not be typical (or at least not universal), and in that case the MPS 
plugin may well be what you need.




--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



[slurm-users] Issues with Slurm 23.11.1

2024-01-23 Thread Fokke Dijkstra
Dear all,

Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with
the communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores.
Unfortunately some nodes are in a different network preventing full
internode communication. A network topology and setting TopologyParam
RouteTree have been used to make sure no slurmd communication happens
between nodes on different networks.

In the new Slurm version we see the following issues, which did not appear
in 22.05:

1. slurmd processes acquire many network connections in CLOSE-WAIT (or
CLOSE_WAIT depending on the tool used) causing the processes to hang, when
trying to restart slurmd.

When checking for CLOSE-WAIT processes we see the following behaviour:
Recv-Q Send-Q Local Address:Port  Peer Address:Port Process

1  0  10.5.2.40:6818 10.5.0.43:58572
users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
1  0  10.5.2.40:6818 10.5.0.43:58284
users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
1  0  10.5.2.40:6818 10.5.0.43:58186
users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
1  0  10.5.2.40:6818 10.5.0.43:58592
users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
1  0  10.5.2.40:6818 10.5.0.43:58338
users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
1  0  10.5.2.40:6818 10.5.0.43:58568
users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
1  0  10.5.2.40:6818 10.5.0.43:58472
users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
1  0  10.5.2.40:6818 10.5.0.43:58486
users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
1  0  10.5.2.40:6818 10.5.0.43:58316
users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))

The first IP address is that of the compute node, the second that of the
node running slurmctld. The nodes can communicate using these IP addresses
just fine.

2. slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started
[2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address
already in use
[2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address
already in use

This is probably because of the processes being in CLOSE-WAIT, which can
only be killed using signal -9.

3. We see jobs stuck in completing CG state, probably due to communication
issues between slurmctld and slurmd. The slurmctld sends repeated kill
requests but those do not seem to be acknowledged by the client. This
happens more often in large job arrays, or generally when many jobs start
at the same time. However, this could be just a biased observation (i.e.,
it is more noticeable on large job arrays because there are more jobs to
fail in the first place).

4. Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user
environment
[2024-01-17T09:58:48.590] error: Failed to load current user environment
variables
[2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local
environment, running only with passed environment
The effect of this is that the users run with the wrong environment and
can’t load the modules for the software that is needed by their jobs. This
leads to many job failures.

The issue appears to be somewhat similar to the one described at:
https://bugs.schedmd.com/show_bug.cgi?id=18561
In that case the site downgraded the slurmd clients to 22.05 which got rid
of the problems.
We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also
seems to be a workaround for the issue.

Does anyone know of a better solution?

Kind regards,

Fokke Dijkstra

-- 
Fokke Dijkstra  
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands


Re: [slurm-users] Database cluster

2024-01-23 Thread Daniel L'Hommedieu
Hi Diego.

In our setup, the database is critical.  We have some wrapper scripts that 
consult the database for information, and we also set environment variables on 
login, based on user/partition associations.  If the database is down, none of 
those things work.

I doubt there is appetite in the organization to change the way our setup 
works, but if we can improve database reliability, that would be a good 
solution.  Mostly I am interested in protecting from hardware failure, and 
that’s why I’m interested in a cluster solution such as XtraDB.

Thanks.

Daniel

> On Jan 23, 2024, at 03:23, Diego Zuccato  wrote:
> 
> IIUC the database is not "critical": if it goes down, you lose access to some 
> statistics. But job data gets cached anyway and the db will be updated when 
> it comes back online.
> 
> Diego
> 
> Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:
>> Community:
>> What do you do to ensure database reliability in your SLURM environment?  We 
>> can have multiple controllers and multiple slurmdbds, but my understanding 
>> is that slurmdbd can be configured with a single MySQL server, so what do 
>> you do?  Do you have that “single MySQL server” be a cluster, such as 
>> Percona XtraDB?  Do you use MySQL replication, then manually switch to 
>> slurmdbd to a replication slave if the master goes down?  Do you do 
>> something else?
>> Thanks.
>> Daniel
> 
> -- 
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
> 




Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Tim Schneider

Hi,

I have filed a bug report with SchedMD 
(https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told 
me they cannot invest time in this issue since I don't have a support 
contract. Maybe they will look into it once it affects more people or 
someone important enough.


So far, I have resorted to using 5.15.0-89-generic, but I am also a bit 
concerned about the security aspect of this choice.


Best,

Tim

On 23.01.24 14:59, Stefan Fleischmann wrote:

Hi!

I'm seeing the same in our environment. My conclusion is that it is a
regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic.
Last working kernel version is 5.15.0-89-generic. I have filed a bug
report here: https://bugs.launchpad.net/bugs/2050098

Please add yourself to the affected users in the bug report so it
hopefully gets more attention.

I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does
not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an
option for now. Reverting back to 5.15.0-89 would work as well, but I
haven't looked into the security aspects of that.

Cheers,
Stefan

On Mon, 22 Jan 2024 13:31:15 -0300
cristobal.navarro.g at gmail.com wrote:


Hi Tim and community,
We are currently having the same issue (cgroups not working it seems,
showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple
of days ago after a full update (apt upgrade). Now whenever we launch
a job for that partition, we get the error message mentioned by Tim.
As a note, we have another custom GPU-compute node with L40s, on a
different partition, and that one works fine.
Before this error, we always had small differences in kernel version
between nodes, so I am not sure if this can be the problem.
Nevertheless, here is the info of our nodes as well.

*[Problem node]* The DGX A100 node has this kernel
cnavarro at nodeGPU01:~$ uname -a
Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30
UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

*[Functioning node]* The Custom GPU node (L40s) has this kernel
cnavarro at nodeGPU02:~$ uname -a
Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08
UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

*And the login node *(slurmctld)
?  ~ uname -a
Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Any ideas what we should check?

On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider  wrote:


Hi,

I am using SLURM 22.05.9 on a small compute cluster. Since I
reinstalled two of our nodes, I get the following error when
launching a job:

slurmstepd: error: load_ebpf_prog: BPF load error (No space left on
device). Please check your system limits (MEMLOCK).

Also the cgroups do not seem to work properly anymore, as I am able
to see all GPUs even if I do not request them, which is not the
case on the other nodes.

One difference I found between the updated nodes and the original
nodes (both are Ubuntu 22.04) is the kernel version, which is
"5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and
"5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could
not figure out how to install the exact first kernel version on the
updated nodes, but I noticed that when I reinstall 5.15.0 with this
tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the
error message disappears. However, once I do that, the network
driver does not function properly anymore, so this does not seem to
be a good solution.

Has anyone seen this issue before or is there maybe something else I
should take a look at? I am also happy to just find a workaround
such that I can take these nodes back online.

I appreciate any help!

Thanks a lot in advance and best wishes,

Tim


  




Re: [slurm-users] Database cluster

2024-01-23 Thread Xand Meaden
Hi,

We are using Percona XtraDB cluster to achieve HA for our Slurm databases. 
There is a single virtual IP that will be kept on one of the cluster's servers 
using keepalived.

Regards,
Xand

From: slurm-users  on behalf of Daniel 
L'Hommedieu 
Sent: 22 January 2024 17:23
To: Slurm User Community List 
Subject: [slurm-users] Database cluster

[You don't often get email from dlhommed...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Community:

What do you do to ensure database reliability in your SLURM environment?  We 
can have multiple controllers and multiple slurmdbds, but my understanding is 
that slurmdbd can be configured with a single MySQL server, so what do you do?  
Do you have that “single MySQL server” be a cluster, such as Percona XtraDB?  
Do you use MySQL replication, then manually switch to slurmdbd to a replication 
slave if the master goes down?  Do you do something else?

Thanks.

Daniel


Re: [slurm-users] Database cluster

2024-01-23 Thread Daniel L'Hommedieu
Xand,

Thanks - that’s great to hear.  I was thinking of using Anycast to achieve the 
same thing, but good to know that keepalived is a viable solution as well.

Best,
Daniel

> On Jan 23, 2024, at 09:29, Xand Meaden  wrote:
> 
> Hi,
> 
> We are using Percona XtraDB cluster to achieve HA for our Slurm databases. 
> There is a single virtual IP that will be kept on one of the cluster's 
> servers using keepalived.
> 
> Regards,
> Xand
> From: slurm-users  > on behalf of Daniel 
> L'Hommedieu mailto:dlhommed...@gmail.com>>
> Sent: 22 January 2024 17:23
> To: Slurm User Community List  >
> Subject: [slurm-users] Database cluster
>  
> [You don't often get email from dlhommed...@gmail.com 
> . Learn why this is important at 
> https://aka.ms/LearnAboutSenderIdentification 
>  ]
> 
> Community:
> 
> What do you do to ensure database reliability in your SLURM environment?  We 
> can have multiple controllers and multiple slurmdbds, but my understanding is 
> that slurmdbd can be configured with a single MySQL server, so what do you 
> do?  Do you have that “single MySQL server” be a cluster, such as Percona 
> XtraDB?  Do you use MySQL replication, then manually switch to slurmdbd to a 
> replication slave if the master goes down?  Do you do something else?
> 
> Thanks.
> 
> Daniel



Re: [slurm-users] Issues with Slurm 23.11.1

2024-01-23 Thread Brian Haymore
Do you have a firewall between the slurmd and the slurmctld daemons?  If yes, 
do you know what kind of idle timeout that firewall has for expiring idle 
sessions?  I ran into something somewhat similar but for me it was between the 
slurmctld and slurmdbd where a recent change they made had one direction 
between those two daemons left idle unless certain operations occurred and we 
did have a firewall device between them that was expiring sessions.  In our 
case 23.11.1 brought a fix for that specific issue for us.  I never had issues 
between slurmctld and slurmd (though the firewall is not between those two 
layers).

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150
http://bit.ly/1HO1N2C

From: slurm-users  on behalf of Fokke 
Dijkstra 
Sent: Tuesday, January 23, 2024 4:00 AM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] Issues with Slurm 23.11.1

Dear all,

Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the 
communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately 
some nodes are in a different network preventing full internode communication. 
A network topology and setting TopologyParam RouteTree have been used to make 
sure no slurmd communication happens between nodes on different networks.

In the new Slurm version we see the following issues, which did not appear in 
22.05:

1. slurmd processes acquire many network connections in CLOSE-WAIT (or 
CLOSE_WAIT depending on the tool used) causing the processes to hang, when 
trying to restart slurmd.

When checking for CLOSE-WAIT processes we see the following behaviour:
Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
1  0  10.5.2.40:6818 
10.5.0.43:58572 
users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
1  0  10.5.2.40:6818 
10.5.0.43:58284 
users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
1  0  10.5.2.40:6818 
10.5.0.43:58186 
users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
1  0  10.5.2.40:6818 
10.5.0.43:58592 
users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
1  0  10.5.2.40:6818 
10.5.0.43:58338 
users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
1  0  10.5.2.40:6818 
10.5.0.43:58568 
users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
1  0  10.5.2.40:6818 
10.5.0.43:58472 
users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
1  0  10.5.2.40:6818 
10.5.0.43:58486 
users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
1  0  10.5.2.40:6818 
10.5.0.43:58316 
users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))

The first IP address is that of the compute node, the second that of the node 
running slurmctld. The nodes can communicate using these IP addresses just fine.

2. slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started
[2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address 
already in use
[2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address 
already in use

This is probably because of the processes being in CLOSE-WAIT, which can only 
be killed using signal -9.

3. We see jobs stuck in completing CG state, probably due to communication 
issues between slurmctld and slurmd. The slurmctld sends repeated kill requests 
but those do not seem to be acknowledged by the client. This happens more often 
in large job arrays, or generally when many jobs start at the same time. 
However, this could be just a biased observation (i.e., it is more noticeable 
on large job arrays because there are more jobs to fail in the first place).

4. Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user environment
[2024-01-17T09:58:48.590] error: Failed to load current user environment 
variables
[2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local 
environment, running only with passed environment
The effect of this is that the users run with the wrong environment and can’t 
load the modules for the software that is needed by their jobs. This leads to 
many job failures.

The issue appears to be somewhat similar to the one described at: 
https://bugs.schedmd

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Charles Hedrick
See my comments on https://bugs.launchpad.net/bugs/2050098. There's a pretty 
simple fix in slurm.

As far as I can tell, there's nothing wrong with the slurm code. But it's using 
an option that it doesn't actually need, and that seems to be causing trouble 
in the kernel.



From: slurm-users  on behalf of Tim 
Schneider 
Sent: Tuesday, January 23, 2024 9:20 AM
To: Stefan Fleischmann ; slurm-users@lists.schedmd.com 

Subject: Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error 
(No space left on device). Please check your system limits (MEMLOCK).

Hi,

I have filed a bug report with SchedMD
(https://bugs.schedmd.com/show_bug.cgi?id=18623), but the support told
me they cannot invest time in this issue since I don't have a support
contract. Maybe they will look into it once it affects more people or
someone important enough.

So far, I have resorted to using 5.15.0-89-generic, but I am also a bit
concerned about the security aspect of this choice.

Best,

Tim

On 23.01.24 14:59, Stefan Fleischmann wrote:
> Hi!
>
> I'm seeing the same in our environment. My conclusion is that it is a
> regression in the Ubuntu 5.15 kernel, introduced with 5.15.0-90-generic.
> Last working kernel version is 5.15.0-89-generic. I have filed a bug
> report here: https://bugs.launchpad.net/bugs/2050098
>
> Please add yourself to the affected users in the bug report so it
> hopefully gets more attention.
>
> I've tested with newer kernels (6.5, 6.6 and 6.7) and the problem does
> not exist there. 6.5 is the latest hwe kernel for 22.04 and would be an
> option for now. Reverting back to 5.15.0-89 would work as well, but I
> haven't looked into the security aspects of that.
>
> Cheers,
> Stefan
>
> On Mon, 22 Jan 2024 13:31:15 -0300
> cristobal.navarro.g at gmail.com wrote:
>
>> Hi Tim and community,
>> We are currently having the same issue (cgroups not working it seems,
>> showing all GPUs on jobs) on a GPU-compute node (DGX A100) a couple
>> of days ago after a full update (apt upgrade). Now whenever we launch
>> a job for that partition, we get the error message mentioned by Tim.
>> As a note, we have another custom GPU-compute node with L40s, on a
>> different partition, and that one works fine.
>> Before this error, we always had small differences in kernel version
>> between nodes, so I am not sure if this can be the problem.
>> Nevertheless, here is the info of our nodes as well.
>>
>> *[Problem node]* The DGX A100 node has this kernel
>> cnavarro at nodeGPU01:~$ uname -a
>> Linux nodeGPU01 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30
>> UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>
>> *[Functioning node]* The Custom GPU node (L40s) has this kernel
>> cnavarro at nodeGPU02:~$ uname -a
>> Linux nodeGPU02 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08
>> UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>
>> *And the login node *(slurmctld)
>> ?  ~ uname -a
>> Linux patagon-master 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14
>> 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
>>
>> Any ideas what we should check?
>>
>> On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider > tu-darmstadt.de> wrote:
>>
>>> Hi,
>>>
>>> I am using SLURM 22.05.9 on a small compute cluster. Since I
>>> reinstalled two of our nodes, I get the following error when
>>> launching a job:
>>>
>>> slurmstepd: error: load_ebpf_prog: BPF load error (No space left on
>>> device). Please check your system limits (MEMLOCK).
>>>
>>> Also the cgroups do not seem to work properly anymore, as I am able
>>> to see all GPUs even if I do not request them, which is not the
>>> case on the other nodes.
>>>
>>> One difference I found between the updated nodes and the original
>>> nodes (both are Ubuntu 22.04) is the kernel version, which is
>>> "5.15.0-89-generic #99-Ubuntu SMP" on the functioning nodes and
>>> "5.15.0-91-generic #101-Ubuntu SMP" on the updated nodes. I could
>>> not figure out how to install the exact first kernel version on the
>>> updated nodes, but I noticed that when I reinstall 5.15.0 with this
>>> tool: https://github.com/pimlie/ubuntu-mainline-kernel.sh, the
>>> error message disappears. However, once I do that, the network
>>> driver does not function properly anymore, so this does not seem to
>>> be a good solution.
>>>
>>> Has anyone seen this issue before or is there maybe something else I
>>> should take a look at? I am also happy to just find a workaround
>>> such that I can take these nodes back online.
>>>
>>> I appreciate any help!
>>>
>>> Thanks a lot in advance and best wishes,
>>>
>>> Tim
>>>
>>>
>>>



Re: [slurm-users] GPU devices mapping with job's cgroup in cgroups v2 using eBPF

2024-01-23 Thread Charles Hedrick
To see the specific GPU allocated, I think this will do it:

scontrol show job -d | grep -E "JobId=| GRES"


From: slurm-users  on behalf of Mahendra 
Paipuri 
Sent: Sunday, January 7, 2024 3:33 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] GPU devices mapping with job's cgroup in cgroups v2 
using eBPF

Hello all,

Happy new year!

We have recently upgraded the cgroups on our SLURM cluster to v2. In cgroups 
v1, the interface `/devices.list` used to have the information of which device 
has been attached to that particular cgroup. From my understanding, cgroups v2 
use eBPF to manage devices and so as SLURM to manage the GPUs.

I was looking for a way to be able to programatically determine the job cgroups 
to device mapping and I came across this thread 
(https://bugzilla.redhat.com/show_bug.cgi?id=1717396) which has a similar 
discussion in the context of VMs.

So, I have used `bpftool` to inspect the job cgroups. An example output:
```
# /tmp/bpftool cgroup list 
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1956132
ID   AttachType  AttachFlags Name
```
When I add `effective` flag, I see the attached eBPF program

```
# /tmp/bpftool cgroup list 
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_1956132 effective
ID   AttachType  Name
4197 cgroup_device   Slurm_Cgroup_v2
```
>From my understand, `effective` flag shows the inherited eBPF programs as 
>well. So, my question is at which level of cgroups the eBPF program is 
>attached? I tried to inspect at various levels but all of them returned none.

Then looking into translated byte code of eBPF program, I get the following
```
# /tmp/bpftool prog dump xlated id 4197
  0: (61) r2 = *(u32 *)(r1 +0)
  1: (54) w2 &= 65535
  2: (61) r3 = *(u32 *)(r1 +0)
  3: (74) w3 >>= 16
  4: (61) r4 = *(u32 *)(r1 +4)
  5: (61) r5 = *(u32 *)(r1 +8)
  6: (55) if r2 != 0x2 goto pc+4
  7: (55) if r4 != 0xc3 goto pc+3
  8: (55) if r5 != 0x0 goto pc+2
  9: (b7) r0 = 0
 10: (95) exit
 11: (55) if r2 != 0x2 goto pc+4
 12: (55) if r4 != 0xc3 goto pc+3
 13: (55) if r5 != 0x1 goto pc+2
 14: (b7) r0 = 0
 15: (95) exit
 16: (55) if r2 != 0x2 goto pc+4
 17: (55) if r4 != 0xc3 goto pc+3
 18: (55) if r5 != 0x2 goto pc+2
 19: (b7) r0 = 0
 20: (95) exit
 21: (55) if r2 != 0x2 goto pc+4
 22: (55) if r4 != 0xc3 goto pc+3
 23: (55) if r5 != 0x3 goto pc+2
 24: (b7) r0 = 1
 25: (95) exit
 26: (b7) r0 = 1
 27: (95) exit
```
>From the output, it is clear that GPU:3 (among 0,1,2,3) is the one that is 
>attached to that job's cgroup.

However, I was looking for a way to dump eBPF maps that can directly provide 
the major, minor numbers and permissions of device as discussed in the comment 
(https://bugzilla.redhat.com/show_bug.cgi?id=1717396#c5). When I inspect eBPF 
program, I dont see any maps associated.

```
# /tmp/bpftool prog list id 4197
4197: cgroup_device  name Slurm_Cgroup_v2  tag 1a261c8a913ff67c  gpl
   loaded_at 2024-01-02T08:19:56+0100  uid 0
   xlated 224B  jited 142B  memlock 4096B
```
So, my second question is how can I get a similar information as `map dump` 
that can give us the device's major minor numbers directly instead of parsing 
the byte code from `prog dump`?

I am still discovering the eBPF ecosystem so if I am missing something very 
obvious, please let me know. I would really appreciate that.

Cheers!

Regards
Mahendra



[slurm-users] Slurm version 23.11.2 is now available

2024-01-23 Thread Tim McMullan

We are pleased to announce the availability of Slurm version 23.11.2.

The 23.11.2 release includes a number of fixes to stability and various 
bug fixes. Some notable changes include several fixes to the new 
scontrol reconfigure method, including one that could result in jobs 
getting cancelled prematurely, a couple errors that resulted in the 
backup slurmctld stopping on fail-back, and an issue during upgrades 
with older MySQL versions with a small max_allowed_packet value for 
sites with a large number of associations.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

-Tim



* Changes in Slurm 23.11.2
==
 -- slurmrestd - Reject single http query with multiple path requests.
 -- Fix launching Singularity v4.x containers with srun --container by setting
.process.terminal to true in generated config.json when step has
pseudoterminal (--pty) requested.
 -- Fix loading in dyanmic/cloud node jobs after net_cred expired.
 -- Fix cgroup null path error on slurmd/slurmstepd tear down.
 -- data_parser/v0.0.40 - Prevent failure if accounting is disabled, instead
issue a warning if needed data from the database can not be retrieved.
 -- openapi/slurmctld - Prevent failure if accounting is disabled.
 -- Prevent slurmscriptd processing delays from blocking other threads in
slurmctld while trying to launch various scripts. This is additional work
for a fix in 23.02.6.
 -- Fix memory leak when receiving alias addrs from controller.
 -- scontrol - Accept `scontrol token lifespan=infinite` to create tokens that
effectively do not expire.
 -- Avoid errors when Slurmdb accounting disabled when '--json' or '--yaml' is
invoked with CLI commands and slurmrestd. Add warnings when query would
have populated data from Slurmdb instead of errors.
 -- Fix slurmctld memory leak when running job with --tres-per-task=gres:shard:#
 -- Fix backfill trying to start jobs outside of backfill window.
 -- Fix oversubscription on partitions with PreemptMode=OFF.
 -- Preserve node reason on power up if the node is downed or drained.
 -- data_parser/v0.0.40 - Avoid aborting when invoking a not implemented
parser.
 -- data_parser/v0.0.40 - Fix how nice values are parsed for job submissions.
 -- data_parser/v0.0.40 - Fix regression where parsing error did not result in
invalid request being rejected.
 -- Fix segfault in front-end node registration.
 -- Prevent jobs using none typed gpus from being killed by the controller after
a reconfig or restart.
 -- Fix deadlock situation in the dbd when adding associations.
 -- Update default values of text/blob columns when updating from old mysql
versions in more situations.  This improves a previous fix to handle an
uncommon case when upgrading mysql/mariadb.
 -- Fix rpmbuild in openSUSE/SLES due to incorrect mariadb dependency.
 -- Fix compilation on RHEL 7.
 -- When upgrading the slurmdbd to 23.11, avoid generating a query to update
the association table that is larger than max_allowed_packet which would
result in an upgrade failure.
 -- Fix rare deadlock when a dynamic node registers at the same time that a
once per minute background task occurs.
 -- Fix build issue on 32-bit systems.
 -- data_parser/v0.0.40 - Fix enumerated strings in OpenAPI specification not
have type field specified.
 -- Improve scontrol show job -d information of used shared gres (shard/mps)
topology.
 -- Allow Slurm to compile without MUNGE if --without-munge is used as an
argument to configure.
 -- accounting_storage/mysql - Fix usage query to use new lineage column
instead of lft/rgt.
 -- slurmrestd - Improve handling of missing parsers when content plugins
expect parsers not loaded.
 -- slurmrestd - Correct parsing of StepIds when querying jobs.
 -- slurmrestd - Improve error from parsing failures of lists.
 -- slurmrestd - Improve parsing of singular values for lists.
 -- accounting_storage/mysql - Fix PrivateData=User when listing associations.
 -- Disable sorting of dynamic nodes to avoid issues when restarting with
heterogenous jobs that cause jobs to abort on restart.
 -- Don't allow deletion of non-dynamic nodes.
 -- accounting_storage/mysql - Fix issue adding partition based associations.
 -- Respect non-"slurm" settings for I_MPI_HYDRA_BOOTSTRAP and HYDRA_BOOTSTRAP
and avoid injecting the --external-launcher option which will cause
mpirun/mpiexec to fail with an unexpected argument error.
 -- Fix bug where scontrol hold would change node count for jobs with
implicitly defined node counts.
 -- data_parser/v0.0.40 - Fix regression of support for "hold" in
job description.
 -- Avoid sending KILL RPCs to unresolvable POWERING_UP and POWERED_DOWN nodes.
 -- data_parser/v0.0.38 - Fix several potential NULL dereferences that could
cause slurmrestd to crash.
 -- Add --gres-flags=one-task-per-sharing. Do not allow different tasks in to be
allocated shared gres from t

[slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton
Hello Slurm Folks,

I have a weird issue where on the same server, which acts as both a controller 
and a node, slurmctld can’t find cred_munge.so

slurmctld: debug3: Trying to load plugin 
/app/slurm-24.0.8/lib/slurm/cred_munge.so
slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does not exist or 
not a regular file.
slurmctld: error: Couldn't find the specified plugin name for cred/munge 
looking at all files
slurmctld: error: cannot open plugin directory /app/slurm-24.0.8/lib/slurm
slurmctld: error: cannot find cred plugin for cred/munge
slurmctld: error: cannot create cred context for cred/munge
slurmctld: fatal: failed to initialize cred plugin

But slurmd can:

slurmd: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Munge credential signature plugin type:cred/munge version:0x180800
slurmd: cred/munge: init: Munge credential signature plugin loaded
slurmd: debug3: Success.

This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8 

Thank you,

Jesse


# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=prod-cluster
SlurmctldHost=controller
#
#MailProg=/bin/mail
#MpiDefault=
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
#SwitchType=
TaskPlugin=task/affinity,task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=
#JobAcctGatherFrequency=30
#JobAcctGatherType=
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=controller CPUs=1 State=UNKNOWN
NodeName=node CPUs=1 State=UNKNOWN
PartitionName=prod-part Nodes=ALL Default=YES MaxTime=INFINITE State=UP




Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Sean Crosby
slurmctld runs as the user slurm, whereas slurmd runs as root.

Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm 
to read the files

e.g. you could do (as root)

sudo -u slurm ls /app/slurm-24.0.8/lib/slurm

and see if the slurm user can read the directory (as well as the libraries 
within it)

Sean

From: slurm-users  on behalf of Jesse 
Aiton 
Sent: Wednesday, 24 January 2024 10:14
To: slurm-users@lists.schedmd.com 
Subject: [EXT] [slurm-users] error: Couldn't find the specified plugin name for 
cred/munge looking at all files

External email: Please exercise caution

Hello Slurm Folks,

I have a weird issue where on the same server, which acts as both a controller 
and a node, slurmctld can’t find cred_munge.so

slurmctld: debug3: Trying to load plugin 
/app/slurm-24.0.8/lib/slurm/cred_munge.so
slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does not exist or 
not a regular file.
slurmctld: error: Couldn't find the specified plugin name for cred/munge 
looking at all files
slurmctld: error: cannot open plugin directory /app/slurm-24.0.8/lib/slurm
slurmctld: error: cannot find cred plugin for cred/munge
slurmctld: error: cannot create cred context for cred/munge
slurmctld: fatal: failed to initialize cred plugin

But slurmd can:

slurmd: debug3: Trying to load plugin /app/slurm-24.0.8/lib/slurm/cred_munge.so
slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Munge credential signature plugin type:cred/munge version:0x180800
slurmd: cred/munge: init: Munge credential signature plugin loaded
slurmd: debug3: Success.

This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8

Thank you,

Jesse


# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=prod-cluster
SlurmctldHost=controller
#
#MailProg=/bin/mail
#MpiDefault=
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
#SwitchType=
TaskPlugin=task/affinity,task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=
#JobAcctGatherFrequency=30
#JobAcctGatherType=
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
NodeName=controller CPUs=1 State=UNKNOWN
NodeName=node CPUs=1 State=UNKNOWN
PartitionName=prod-part Nodes=ALL Default=YES MaxTime=INFINITE State=UP




Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski
On Jan 23, 2024, at 18:14, Jesse Aiton  wrote:

This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8

Thank you,

Jesse

I’m not sure what version you’re actually running, but I don’t believe there is 
a 24.0.8. The latest version I’m aware of is 23.11.2.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'


Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton
Yeah, 24.0.8 is the bleeding edge version.  I wanted to try the latest in case 
it was a bug in 20.x.x.  I’m happy to go back to any older Slurm version but I 
don’t think that will matter much if the issue occurs on both Slurm 20 and 
Slurm 24.

git clone https://github.com/SchedMD/slurm.git
Thanks,

Jesse

> On Jan 23, 2024, at 4:07 PM, Ryan Novosielski  wrote:
> 
>> On Jan 23, 2024, at 18:14, Jesse Aiton  wrote:
>> 
>> This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8 
>> 
>> Thank you,
>> 
>> Jesse
> 
> I’m not sure what version you’re actually running, but I don’t believe there 
> is a 24.0.8. The latest version I’m aware of is 23.11.2.
> 
> --
> #BlackLivesMatter
> 
> || \\UTGERS, |---*O*---
> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
>  `'



Re: [slurm-users] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Ryan Novosielski
Ah, I see — no, it’s 24.08. That’s why I didn’t find any reference to it.

Carry on! :-D

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'

On Jan 23, 2024, at 19:13, Jesse Aiton  wrote:

Yeah, 24.0.8 is the bleeding edge version.  I wanted to try the latest in case 
it was a bug in 20.x.x.  I’m happy to go back to any older Slurm version but I 
don’t think that will matter much if the issue occurs on both Slurm 20 and 
Slurm 24.


git clone https://github.com/SchedMD/slurm.git

Thanks,

Jesse

On Jan 23, 2024, at 4:07 PM, Ryan Novosielski  wrote:

On Jan 23, 2024, at 18:14, Jesse Aiton  wrote:

This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8

Thank you,

Jesse

I’m not sure what version you’re actually running, but I don’t believe there is 
a 24.0.8. The latest version I’m aware of is 23.11.2.

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
 `'




Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Jesse Aiton
Hi Sean,

Thank you!  It was a permissions issue and it’s not complaining anymore about 
cred/munge.

I appreciate your help.

Thanks,

Jesse

> On Jan 23, 2024, at 3:34 PM, Sean Crosby  wrote:
> 
> slurmctld runs as the user slurm, whereas slurmd runs as root.
> 
> Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm 
> to read the files
> 
> e.g. you could do (as root)
> 
> sudo -u slurm ls /app/slurm-24.0.8/lib/slurm
> 
> and see if the slurm user can read the directory (as well as the libraries 
> within it)
> 
> Sean
> From: slurm-users  > on behalf of Jesse Aiton 
> mailto:je...@clarkeconsulting.com>>
> Sent: Wednesday, 24 January 2024 10:14
> To: slurm-users@lists.schedmd.com  
> mailto:slurm-users@lists.schedmd.com>>
> Subject: [EXT] [slurm-users] error: Couldn't find the specified plugin name 
> for cred/munge looking at all files
>  
> External email: Please exercise caution
> 
> Hello Slurm Folks,
> 
> I have a weird issue where on the same server, which acts as both a 
> controller and a node, slurmctld can’t find cred_munge.so
> 
> slurmctld: debug3: Trying to load plugin 
> /app/slurm-24.0.8/lib/slurm/cred_munge.so
> slurmctld: debug4: /app/slurm-24.0.8/lib/slurm/cred_munge.so: Does not exist 
> or not a regular file.
> slurmctld: error: Couldn't find the specified plugin name for cred/munge 
> looking at all files
> slurmctld: error: cannot open plugin directory /app/slurm-24.0.8/lib/slurm
> slurmctld: error: cannot find cred plugin for cred/munge
> slurmctld: error: cannot create cred context for cred/munge
> slurmctld: fatal: failed to initialize cred plugin
> 
> But slurmd can:
> 
> slurmd: debug3: Trying to load plugin 
> /app/slurm-24.0.8/lib/slurm/cred_munge.so
> slurmd: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
> name:Munge credential signature plugin type:cred/munge version:0x180800
> slurmd: cred/munge: init: Munge credential signature plugin loaded
> slurmd: debug3: Success.
> 
> This is on Ubuntu 20.04 and happens both with Slurm 20.11.09 and 24.0.8 
> 
> Thank you,
> 
> Jesse
> 
> 
> # slurm.conf file generated by configurator easy.html.
> # Put this file on all nodes of your cluster.
> # See the slurm.conf man page for more information.
> #
> ClusterName=prod-cluster
> SlurmctldHost=controller
> #
> #MailProg=/bin/mail
> #MpiDefault=
> #MpiParams=ports=#-#
> ProctrackType=proctrack/cgroup
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> #SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> #SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=slurm
> #SlurmdUser=root
> StateSaveLocation=/var/spool/slurmctld
> #SwitchType=
> TaskPlugin=task/affinity,task/cgroup
> #
> #
> # TIMERS
> #KillWait=30
> #MinJobAge=300
> #SlurmctldTimeout=120
> #SlurmdTimeout=300
> #
> #
> # SCHEDULING
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> #
> #
> # LOGGING AND ACCOUNTING
> #AccountingStorageType=
> #JobAcctGatherFrequency=30
> #JobAcctGatherType=
> #SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurmctld.log
> #SlurmdDebug=info
> SlurmdLogFile=/var/log/slurmd.log
> #
> #
> # COMPUTE NODES
> NodeName=controller CPUs=1 State=UNKNOWN
> NodeName=node CPUs=1 State=UNKNOWN
> PartitionName=prod-part Nodes=ALL Default=YES MaxTime=INFINITE State=UP