We’re seeing a repeated issue of long-running node allocations eventually disallowing SSH connections.
Our cluster configures the pam_slurm_adopt module, in order to allow users to access nodes they’ve allocated before. However, even if this allocated node is idle, after around 24 hours (we haven’t been able to pinpoint a more precise time frame yet), ssh via said module simply hangs until timeout. Our admin users can access the same node perfectly, via a pam_listfile exception. Other users with allocations might access as well, until this limit is hit Something I noticed recently, is that during these times the extern task for said allocation (generated by PrologFlags=Contain) would be stuck at 100% CPU usage, maxing out a single core Please let us know which logs and/or command outputs to provide to further help debugging Regards, Lucio Delelis Cloud Engineer | lucio.dele...@sixninesit.com<mailto:lucio.dele...@sixninesit.com> [cid:abfc8ae4-6460-4097-865e-58c3eac23a70] ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::