Our test nodes are all connected to LDAP through SSSD, and I can make sure the test user exist on both submit and compute node (getent passwd XXX shows exactly the same result on both nodes).
Brian Andrus wrote: > pam_slurm_adopt is a pam module, so does not talk to slurmd. > It looks like it is having trouble matching the uid info for your tester > user. Is that a local account? It needs to be available with the same > uid/gid on both the submitting node and the node it is trying to run on. > Brian Andrus > On 12/16/2025 3:00 AM, hermes via slurm-users wrote: > > Hello everyone: > > We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got > > broken. This cause the users cannot ssh to the compute node where > > their jobs is running on. > > Of course we have make sure all the slurm related packages have been > > upgraded together. > > Simplest test process is like: > > > > > cat test1.sh > > > > #!/bin/bash > > > > #SBATCH --job-name=test > > > > #SBATCH --partition=debug > > > > #SBATCH --nodes=1 > > > > #SBATCH --nodelist=cas639 > > > > sleep 6000 > > > > > sbatch test1.sh > > > > Submitted batch job 51047091 > > > > > squeue > > > > JOBID PARTITION NAME USER ST TIME NODES > > NODELIST(REASON) > > > > 51047091 debug test tester R 0:28 1 cas639 > > > > > ssh cas639 > > > > *(wait for a long time...)* > > > > Connection closed by 172.16.3.129 port 22 *(finally failed to ssh)* > > > > > > on the target compute node, we can see the following debug message > > from pam_slurm_adopt.so: > > > > cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source: > > using config_file=/etc/slurm/slurm.conf (default) > > > > cas639 pam_slurm_adopt[1007301]: debug: slurm_conf_init: using > > config_file=/etc/slurm/slurm.conf > > > > cas639 pam_slurm_adopt[1007301]: debug: Reading slurm.conf file: > > /etc/slurm/slurm.conf > > > > cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide > > option and cannot be set at partition level, option ignored. > > > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/auth_munge.so > > > > cas639 pam_slurm_adopt[1007301]: debug3: > > plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge > > authentication plugin type:auth/munge version:0x190b00 > > > > cas639 pam_slurm_adopt[1007301]: debug: auth/munge: init: loaded > > > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/certgen_script.so > > > > cas639 pam_slurm_adopt[1007301]: debug3: > > plugin_load_from_file->_verify_syms: found Slurm plugin > > name:Certificate generation script plugin type:certgen/script > > version:0x190b00 > > > > cas639 pam_slurm_adopt[1007301]: debug: certgen/script: init: loaded > > > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/hash_k12.so > > > > cas639 pam_slurm_adopt[1007301]: debug3: > > plugin_load_from_file->_verify_syms: found Slurm plugin > > name:KangarooTwelve hash plugin type:hash/k12 version:0x190b00 > > > > cas639 pam_slurm_adopt[1007301]: debug: hash/k12: init: init: > > KangarooTwelve hash plugin loaded > > > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/tls_none.so > > > > cas639 pam_slurm_adopt[1007301]: debug3: > > plugin_load_from_file->_verify_syms: found Slurm plugin name:Null tls > > plugin type:tls/none version:0x190b00 > > > > cas639 pam_slurm_adopt[1007301]: debug: tls/none: init: tls/none loaded > > > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/accounting_storage_slurmdbd.so > > > > cas639 pam_slurm_adopt[1007301]: debug3: > > plugin_load_from_file->_verify_syms: found Slurm plugin > > name:Accounting storage SLURMDBD plugin > > type:accounting_storage/slurmdbd version:0x190b00 > > > > cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init: > > Accounting storage SLURMDBD plugin loaded > > > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/cred_munge.so > > > > cas639 pam_slurm_adopt[1007301]: debug3: > > plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge > > credential signature plugin type:cred/munge version:0x190b00 > > > > cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential > > signature plugin loaded > > > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > > > cas639 pam_slurm_adopt[1007301]: debug: Reading cgroup.conf file > > /etc/slurm/cgroup.conf > > > > cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.extern > > > > cas639 pam_slurm_adopt[1007301]: debug4: found StepId=51047091.batch > > > > cas639 pam_slurm_adopt[1007301]: Connection by user tester: user has > > only one job 51047091 > > > > cas639 pam_slurm_adopt[1007301]: debug: _adopt_process: trying to get > > StepId=51047091.extern to adopt 1007301 > > > > cas639 pam_slurm_adopt[1007301]: debug: Leaving stepd_add_extern_pid > > > > cas639 pam_slurm_adopt[1007301]: debug: Leaving stepd_get_x11_display > > > > cas639 pam_slurm_adopt[1007301]: debug: entering stepd_get_namespace_fd > > > > > > It looks like something block during stepd_get_namespace_fd? And we > > found nothing in slurmd log even with *SlurmdDebug = debug5*, so I > > guess the pam module had not run to the step to talk with slurmd (if > > it should). > > Would it be a compatibility problem between slurm25.11 and EL8 system > > or cgroup/v1? > > Or can anyone help to give some suggestion on how to further locate > > the fault point? > > Best regards, > > Hermes > > -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
