We’re seeing a repeated issue of long-running node allocations eventually 
disallowing SSH connections.


Our cluster configures the pam_slurm_adopt module, in order to allow users to 
access nodes they’ve allocated before. However, even if this allocated node is 
idle, after around 24 hours (we haven’t been able to pinpoint a more precise 
time frame yet), ssh via said module simply hangs until timeout.


Our admin users can access the same node perfectly, via a pam_listfile 
exception. Other users with allocations might access as well, until this limit 
is hit


Something I noticed recently, is that during these times the extern task for 
said allocation (generated by PrologFlags=Contain) would be stuck at 100% CPU 
usage, maxing out a single core


Please let us know which logs and/or command outputs to provide to further help 
debugging


Regards,
Lucio Delelis

Cloud Engineer | 
lucio.dele...@sixninesit.com<mailto:lucio.dele...@sixninesit.com>
[cid:abfc8ae4-6460-4097-865e-58c3eac23a70]
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Reply via email to