pam_slurm_adopt is a pam module, so does not talk to slurmd.
It looks like it is having trouble matching the uid info for your tester
user. Is that a local account? It needs to be available with the same
uid/gid on both the submitting node and the node it is trying to run on.
Brian Andrus
On 12/16/2025 3:00 AM, hermes via slurm-users wrote:
Hello everyone:
We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got
broken. This cause the users cannot ssh to the compute node where
their jobs is running on.
Of course we have make sure all the slurm related packages have been
upgraded together.
Simplest test process is like:
```
> cat test1.sh
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --nodelist=cas639
sleep 6000
> sbatch test1.sh
Submitted batch job 51047091
> squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
51047091 debug test tester R 0:28 1 cas639
> ssh cas639
*(wait for a long time...)*
Connection closed by 172.16.3.129 port 22 *(finally failed to ssh)*
```
on the target compute node, we can see the following debug message
from pam_slurm_adopt.so:
```
cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source:
using config_file=/etc/slurm/slurm.conf (default)
cas639 pam_slurm_adopt[1007301]: debug: slurm_conf_init: using
config_file=/etc/slurm/slurm.conf
cas639 pam_slurm_adopt[1007301]: debug: Reading slurm.conf file:
/etc/slurm/slurm.conf
cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide
option and cannot be set at partition level, option ignored.
cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
/usr/lib64/slurm/auth_munge.so
cas639 pam_slurm_adopt[1007301]: debug3:
plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge
authentication plugin type:auth/munge version:0x190b00
cas639 pam_slurm_adopt[1007301]: debug: auth/munge: init: loaded
cas639 pam_slurm_adopt[1007301]: debug3: Success.
cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
/usr/lib64/slurm/certgen_script.so
cas639 pam_slurm_adopt[1007301]: debug3:
plugin_load_from_file->_verify_syms: found Slurm plugin
name:Certificate generation script plugin type:certgen/script
version:0x190b00
cas639 pam_slurm_adopt[1007301]: debug: certgen/script: init: loaded
cas639 pam_slurm_adopt[1007301]: debug3: Success.
cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
/usr/lib64/slurm/hash_k12.so
cas639 pam_slurm_adopt[1007301]: debug3:
plugin_load_from_file->_verify_syms: found Slurm plugin
name:KangarooTwelve hash plugin type:hash/k12 version:0x190b00
cas639 pam_slurm_adopt[1007301]: debug: hash/k12: init: init:
KangarooTwelve hash plugin loaded
cas639 pam_slurm_adopt[1007301]: debug3: Success.
cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
/usr/lib64/slurm/tls_none.so
cas639 pam_slurm_adopt[1007301]: debug3:
plugin_load_from_file->_verify_syms: found Slurm plugin name:Null tls
plugin type:tls/none version:0x190b00
cas639 pam_slurm_adopt[1007301]: debug: tls/none: init: tls/none loaded
cas639 pam_slurm_adopt[1007301]: debug3: Success.
cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
/usr/lib64/slurm/accounting_storage_slurmdbd.so
cas639 pam_slurm_adopt[1007301]: debug3:
plugin_load_from_file->_verify_syms: found Slurm plugin
name:Accounting storage SLURMDBD plugin
type:accounting_storage/slurmdbd version:0x190b00
cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init:
Accounting storage SLURMDBD plugin loaded
cas639 pam_slurm_adopt[1007301]: debug3: Success.
cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
/usr/lib64/slurm/cred_munge.so
cas639 pam_slurm_adopt[1007301]: debug3:
plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge
credential signature plugin type:cred/munge version:0x190b00
cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential
signature plugin loaded
cas639 pam_slurm_adopt[1007301]: debug3: Success.
cas639 pam_slurm_adopt[1007301]: debug: Reading cgroup.conf file
/etc/slurm/cgroup.conf
cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.extern
cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.batch
cas639 pam_slurm_adopt[1007301]: Connection by user tester: user has
only one job 51047091
cas639 pam_slurm_adopt[1007301]: debug: _adopt_process: trying to get
StepId=51047091.extern to adopt 1007301
cas639 pam_slurm_adopt[1007301]: debug: Leaving stepd_add_extern_pid
cas639 pam_slurm_adopt[1007301]: debug: Leaving stepd_get_x11_display
cas639 pam_slurm_adopt[1007301]: debug: entering stepd_get_namespace_fd
```
It looks like something block during stepd_get_namespace_fd? And we
found nothing in slurmd log even with *SlurmdDebug = debug5*, so I
guess the pam module had not run to the step to talk with slurmd (if
it should).
Would it be a compatibility problem between slurm25.11 and EL8 system
or cgroup/v1?
Or can anyone help to give some suggestion on how to further locate
the fault point?
Best regards,
Hermes
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]