Hello!

I'm also experiencing this problem, on Rocky 8 machinery.  I did test
switching a node over to cgroupv2, and it still failed similarly.  Note
that the slurmd from v25.05.5 works fine.  I've not tried isolating it
further, but likely will.  (Unless 25.11.1 hits with a fix soon!)



On Tue, 2025-12-16 at 10:58 -0800, Brian Andrus via slurm-users wrote:
> 
> ---- External Email: Use caution with attachments, links, or sharing
> data ----
> 
> 
> 
> 
> pam_slurm_adopt is a pam module, so does not talk to slurmd.
> It looks like it is having trouble matching the uid info for your
> tester user. Is that a local account? It needs to be available with
> the same uid/gid on both the submitting node and the node it is
> trying to run on.
> Brian Andrus
> On 12/16/2025 3:00 AM, hermes via slurm-users wrote:
> > 
> > 
> > 
> > 
> > Hello everyone:
> >  
> > We recently upgrade slurm to 25.11 and find the pam_slurm_adopt got
> > broken. This cause the users cannot ssh to the compute node where
> > their jobs is running on.
> > Of course we have make sure all the slurm related packages have
> > been upgraded together.
> > Simplest test process is like:
> > ```
> > > cat test1.sh
> > #!/bin/bash
> > #SBATCH --job-name=test
> > #SBATCH --partition=debug
> > #SBATCH --nodes=1
> > #SBATCH --nodelist=cas639
> > sleep 6000
> >  
> > > sbatch test1.sh
> > Submitted batch job 51047091
> >  
> > > squeue
> >              JOBID PARTITION     NAME     USER ST       TIME  NODES
> > NODELIST(REASON)
> >           51047091     debug     test   tester  R       0:28      1
> > cas639
> >  
> > > ssh cas639
> > (wait for a long time...)
> > Connection closed by 172.16.3.129 port 22(finally failed to ssh)
> > ```
> > on the target compute node, we can see the following debug message
> > from pam_slurm_adopt.so:
> > ```
> > cas639 pam_slurm_adopt[1007301]: debug2: _establish_config_source:
> > using config_file=/etc/slurm/slurm.conf (default)
> > cas639 pam_slurm_adopt[1007301]: debug:  slurm_conf_init: using
> > config_file=/etc/slurm/slurm.conf
> > cas639 pam_slurm_adopt[1007301]: debug:  Reading slurm.conf file:
> > /etc/slurm/slurm.conf
> > cas639 pam_slurm_adopt[1007301]: PreemptMode=GANG is a cluster-wide
> > option and cannot be set at partition level, option ignored.
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
> > /usr/lib64/slurm/auth_munge.so
> > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file-
> > >_verify_syms: found Slurm plugin name:Munge authentication plugin
> > type:auth/munge version:0x190b00
> > cas639 pam_slurm_adopt[1007301]: debug:  auth/munge: init: loaded
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
> > /usr/lib64/slurm/certgen_script.so
> > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file-
> > >_verify_syms: found Slurm plugin name:Certificate generation
> > script plugin type:certgen/script version:0x190b00
> > cas639 pam_slurm_adopt[1007301]: debug:  certgen/script: init:
> > loaded
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
> > /usr/lib64/slurm/hash_k12.so
> > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file-
> > >_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin
> > type:hash/k12 version:0x190b00
> > cas639 pam_slurm_adopt[1007301]: debug:  hash/k12: init: init:
> > KangarooTwelve hash plugin loaded
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
> > /usr/lib64/slurm/tls_none.so
> > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file-
> > >_verify_syms: found Slurm plugin name:Null tls plugin
> > type:tls/none version:0x190b00
> > cas639 pam_slurm_adopt[1007301]: debug:  tls/none: init: tls/none
> > loaded
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
> > /usr/lib64/slurm/accounting_storage_slurmdbd.so
> > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file-
> > >_verify_syms: found Slurm plugin name:Accounting storage SLURMDBD
> > plugin type:accounting_storage/slurmdbd version:0x190b00
> > cas639 pam_slurm_adopt[1007301]: accounting_storage/slurmdbd: init:
> > Accounting storage SLURMDBD plugin loaded
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > cas639 pam_slurm_adopt[1007301]: debug3: Trying to load plugin
> > /usr/lib64/slurm/cred_munge.so
> > cas639 pam_slurm_adopt[1007301]: debug3: plugin_load_from_file-
> > >_verify_syms: found Slurm plugin name:Munge credential signature
> > plugin type:cred/munge version:0x190b00
> > cas639 pam_slurm_adopt[1007301]: cred/munge: init: Munge credential
> > signature plugin loaded
> > cas639 pam_slurm_adopt[1007301]: debug3: Success.
> > cas639 pam_slurm_adopt[1007301]: debug:  Reading cgroup.conf file
> > /etc/slurm/cgroup.conf
> > cas639 pam_slurm_adopt[1007301]: debug4: found
> > StepId=51047091.extern
> > cas639 pam_slurm_adopt[1007301]: debug4: found
> > StepId=51047091.batch
> > cas639 pam_slurm_adopt[1007301]: Connection by user tester: user
> > has only one job 51047091
> > cas639 pam_slurm_adopt[1007301]: debug:  _adopt_process: trying to
> > get StepId=51047091.extern to adopt 1007301
> > cas639 pam_slurm_adopt[1007301]: debug:  Leaving
> > stepd_add_extern_pid
> > cas639 pam_slurm_adopt[1007301]: debug:  Leaving
> > stepd_get_x11_display
> > cas639 pam_slurm_adopt[1007301]: debug:  entering
> > stepd_get_namespace_fd
> > ```
> > It looks like something block during stepd_get_namespace_fd? And we
> > found nothing in slurmd log even withSlurmdDebug = debug5, so I
> > guess the pam module had not run to the step to talk with slurmd
> > (if it should).
> > Would it be a compatibility problem between slurm25.11 and EL8
> > system or cgroup/v1?
> > Or can anyone help to give some suggestion on how to further locate
> > the fault point?
> >  
> > Best regards,
> > Hermes
> > 
> > 

-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to