Our test nodes are all connected to LDAP through SSSD, and I can make sure the 
test user exist on both submit and compute node (getent passwd XXX shows 
exactly the same result on both nodes).

Brian Andrus wrote:
> pam_slurm_adopt is a pam module, so does not talk to slurmd.
> It looks like it is having trouble matching the uid info for your tester 
> user. Is that a local account? It needs to be available with the same 
> uid/gid on both the submitting node and the node it is trying to run on.
> Brian Andrus
> On 12/16/2025 3:00 AM, hermes via slurm-users wrote:
> > Hello everyone:
> > We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got 
> > broken. This cause the users cannot ssh to the compute node where 
> > their jobs is running on.
> > Of course we have make sure all the slurm related packages have been 
> > upgraded together.
> > Simplest test process is like:
> > 
> > > cat test1.sh
> > 
> > #!/bin/bash
> > 
> > #SBATCH --job-name=test
> > 
> > #SBATCH --partition=debug
> > 
> > #SBATCH --nodes=1
> > 
> > #SBATCH --nodelist=cas639
> > 
> > sleep 6000
> > 
> > > sbatch test1.sh
> > 
> > Submitted batch job 51047091
> > 
> > > squeue
> > 
> >              JOBID PARTITION     NAME     USER ST       TIME  NODES 
> > NODELIST(REASON)
> > 
> >           51047091 debug     test   tester  R       0:28      1 cas639
> > 
> > > ssh cas639
> > 
> > *(wait for a long time...)*
> > 
> > Connection closed by 172.16.3.129 port 22 *(finally failed to ssh)*
> > 
> > 
> > on the target compute node, we can see the following debug message 
> > from pam_slurm_adopt.so:
> > 
> > cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source: 
> > using config_file=/etc/slurm/slurm.conf (default)
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  slurm_conf_init: using 
> > config_file=/etc/slurm/slurm.conf
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  Reading slurm.conf file: 
> > /etc/slurm/slurm.conf
> > 
> > cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide 
> > option and cannot be set at partition level, option ignored.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin 
> > /usr/lib64/slurm/auth_munge.so
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: 
> > plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge 
> > authentication plugin type:auth/munge version:0x190b00
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  auth/munge: init: loaded
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin 
> > /usr/lib64/slurm/certgen_script.so
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: 
> > plugin_load_from_file->_verify_syms: found Slurm plugin 
> > name:Certificate generation script plugin type:certgen/script 
> > version:0x190b00
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  certgen/script: init: loaded
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin 
> > /usr/lib64/slurm/hash_k12.so
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: 
> > plugin_load_from_file->_verify_syms: found Slurm plugin 
> > name:KangarooTwelve hash plugin type:hash/k12 version:0x190b00
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  hash/k12: init: init: 
> > KangarooTwelve hash plugin loaded
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin 
> > /usr/lib64/slurm/tls_none.so
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: 
> > plugin_load_from_file->_verify_syms: found Slurm plugin name:Null tls 
> > plugin type:tls/none version:0x190b00
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  tls/none: init: tls/none loaded
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin 
> > /usr/lib64/slurm/accounting_storage_slurmdbd.so
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: 
> > plugin_load_from_file->_verify_syms: found Slurm plugin 
> > name:Accounting storage SLURMDBD plugin 
> > type:accounting_storage/slurmdbd version:0x190b00
> > 
> > cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init: 
> > Accounting storage SLURMDBD plugin loaded
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin 
> > /usr/lib64/slurm/cred_munge.so
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: 
> > plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge 
> > credential signature plugin type:cred/munge version:0x190b00
> > 
> > cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential 
> > signature plugin loaded
> > 
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  Reading cgroup.conf file 
> > /etc/slurm/cgroup.conf
> > 
> > cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.extern
> > 
> > cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.batch
> > 
> > cas639 pam_slurm_adopt[1007301]: Connection by user tester: user has 
> > only one job 51047091
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  _adopt_process: trying to get 
> > StepId=51047091.extern to adopt 1007301
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  Leaving stepd_add_extern_pid
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  Leaving stepd_get_x11_display
> > 
> > cas639 pam_slurm_adopt[1007301]: debug:  entering stepd_get_namespace_fd
> > 
> > 
> > It looks like something block during stepd_get_namespace_fd? And we 
> > found nothing in slurmd log even with *SlurmdDebug = debug5*, so I 
> > guess the pam module had not run to the step to talk with slurmd (if 
> > it should).
> > Would it be a compatibility problem between slurm25.11 and EL8 system 
> > or cgroup/v1?
> > Or can anyone help to give some suggestion on how to further locate 
> > the fault point?
> > Best regards,
> > Hermes
> >

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to