[slurm-users] sacctmgr: error: Sending PersistInit msg: Connection refused

2025-05-17 Thread Ratnasamy, Fritz via slurm-users
Hi, We are working on a test cluster with slurm 24.11.3 and I am getting this error message from the login or compute nodes (note that the error does not show when run from the controller node): sacctmgr list associations tree format=cluster,account,user,maxnodes sacctmgr: error: _open_persist_co

[slurm-users] New version of slurm

2025-05-16 Thread Ratnasamy, Fritz via slurm-users
Hi, I am trying to install the new version of slurm. Do you know if there is a way to find out what support is compiled into the executables? For example, apache has httpd -L which shows all the loaded modules. See below result: [image: image.png] *Fritz Ratnasamy*Data Scientist Information Te

[slurm-users] Suspending jobs and resuming

2024-11-21 Thread Ratnasamy, Fritz via slurm-users
Hi, I am using an old slurm version 20.11.8 and we had to reboot our cluster today for maintenance. I suspended all the jobs on it with the command scontrol suspend list_job_ids and all the jobs paused and were suspended. However, when I tried to resume them after the reboot, scontrol resume did

[slurm-users] Re: Not being able to ssh to node with running job

2024-06-06 Thread Ratnasamy, Fritz via slurm-users
, *Fritz Ratnasamy* Data Scientist Information Technology On Thu, Jun 6, 2024 at 2:11 PM Ratnasamy, Fritz via slurm-users < slurm-users@lists.schedmd.com> wrote: > As admin on the cluster, we do not observe any issue on our newly added > gpu nodes. > However, for regular users, they

[slurm-users] Not being able to ssh to node with running job

2024-06-06 Thread Ratnasamy, Fritz via slurm-users
As admin on the cluster, we do not observe any issue on our newly added gpu nodes. However, for regular users, they are not seeing their jobs running on these gpu nodes when running squeue -u ( it is however showing as running status with sacct) and they are not able to ssh to these newly added

[slurm-users] Removing safely a node

2024-05-16 Thread Ratnasamy, Fritz via slurm-users
Hi, What is the "official" process to remove nodes safely? I have drained the nodes so jobs are completed and put them in down state after they are completely drained. I edited the slurm.conf file to remove the nodes. After some time, I can see that the nodes were removed from the partition with