Hello! I'm also experiencing this problem, on Rocky 8 machinery. I did test switching a node over to cgroupv2, and it still failed similarly. Note that the slurmd from v25.05.5 works fine. I've not tried isolating it further, but likely will. (Unless 25.11.1 hits with a fix soon!)
On Tue, 2025-12-16 at 10:58 -0800, Brian Andrus via slurm-users wrote: > > ---- External Email: Use caution with attachments, links, or sharing > data ---- > > > > > pam_slurm_adopt is a pam module, so does not talk to slurmd. > It looks like it is having trouble matching the uid info for your > tester user. Is that a local account? It needs to be available with > the same uid/gid on both the submitting node and the node it is > trying to run on. > Brian Andrus > On 12/16/2025 3:00 AM, hermes via slurm-users wrote: > > > > > > > > > > Hello everyone: > > > > We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got > > broken. This cause the users cannot ssh to the compute node where > > their jobs is running on. > > Of course we have make sure all the slurm related packages have > > been upgraded together. > > Simplest test process is like: > > ``` > > > cat test1.sh > > #!/bin/bash > > #SBATCH --job-name=test > > #SBATCH --partition=debug > > #SBATCH --nodes=1 > > #SBATCH --nodelist=cas639 > > sleep 6000 > > > > > sbatch test1.sh > > Submitted batch job 51047091 > > > > > squeue > > JOBID PARTITION NAME USER ST TIME NODES > > NODELIST(REASON) > > 51047091 debug test tester R 0:28 1 > > cas639 > > > > > ssh cas639 > > (wait for a long time...) > > Connection closed by 172.16.3.129 port 22(finally failed to ssh) > > ``` > > on the target compute node, we can see the following debug message > > from pam_slurm_adopt.so: > > ``` > > cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source: > > using config_file=/etc/slurm/slurm.conf (default) > > cas639 pam_slurm_adopt[1007301]: debug: slurm_conf_init: using > > config_file=/etc/slurm/slurm.conf > > cas639 pam_slurm_adopt[1007301]: debug: Reading slurm.conf file: > > /etc/slurm/slurm.conf > > cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide > > option and cannot be set at partition level, option ignored. > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/auth_munge.so > > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file- > > >_verify_syms: found Slurm plugin name:Munge authentication plugin > > type:auth/munge version:0x190b00 > > cas639 pam_slurm_adopt[1007301]: debug: auth/munge: init: loaded > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/certgen_script.so > > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file- > > >_verify_syms: found Slurm plugin name:Certificate generation > > script plugin type:certgen/script version:0x190b00 > > cas639 pam_slurm_adopt[1007301]: debug: certgen/script: init: > > loaded > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/hash_k12.so > > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file- > > >_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin > > type:hash/k12 version:0x190b00 > > cas639 pam_slurm_adopt[1007301]: debug: hash/k12: init: init: > > KangarooTwelve hash plugin loaded > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/tls_none.so > > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file- > > >_verify_syms: found Slurm plugin name:Null tls plugin > > type:tls/none version:0x190b00 > > cas639 pam_slurm_adopt[1007301]: debug: tls/none: init: tls/none > > loaded > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/accounting_storage_slurmdbd.so > > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file- > > >_verify_syms: found Slurm plugin name:Accounting storage SLURMDBD > > plugin type:accounting_storage/slurmdbd version:0x190b00 > > cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init: > > Accounting storage SLURMDBD plugin loaded > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin > > /usr/lib64/slurm/cred_munge.so > > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file- > > >_verify_syms: found Slurm plugin name:Munge credential signature > > plugin type:cred/munge version:0x190b00 > > cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential > > signature plugin loaded > > cas639 pam_slurm_adopt[1007301]: debug3: Success. > > cas639 pam_slurm_adopt[1007301]: debug: Reading cgroup.conf file > > /etc/slurm/cgroup.conf > > cas639 pam_slurm_adopt[1007301]: debug4: found > > StepId=51047091.extern > > cas639 pam_slurm_adopt[1007301]: debug4: found > > StepId=51047091.batch > > cas639 pam_slurm_adopt[1007301]: Connection by user tester: user > > has only one job 51047091 > > cas639 pam_slurm_adopt[1007301]: debug: _adopt_process: trying to > > get StepId=51047091.extern to adopt 1007301 > > cas639 pam_slurm_adopt[1007301]: debug: Leaving > > stepd_add_extern_pid > > cas639 pam_slurm_adopt[1007301]: debug: Leaving > > stepd_get_x11_display > > cas639 pam_slurm_adopt[1007301]: debug: entering > > stepd_get_namespace_fd > > ``` > > It looks like something block during stepd_get_namespace_fd? And we > > found nothing in slurmd log even withSlurmdDebug = debug5, so I > > guess the pam module had not run to the step to talk with slurmd > > (if it should). > > Would it be a compatibility problem between slurm25.11 and EL8 > > system or cgroup/v1? > > Or can anyone help to give some suggestion on how to further locate > > the fault point? > > > > Best regards, > > Hermes > > > > -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
