[slurm-users] Slurm 20.11.3, Suspended new connections while processing backlog filled /

Robert Kudyba Wed, 10 Mar 2021 08:11:47 -0800

I see there is this exact issue
https://githubmemory.com/repo/dun/munge/issues/94. We are on Slurm 20.11.3
on Bright Cluster 8.1 on Centos 7.9


I found hundreds of these logs in slurmctld:
error: slurm_accept_msg_conn: Too many open files in system

Then in munged.log:
Suspended new connections while processing backlog

Also in slurmctld.log:
Mar 7 15:40:21 node003 nslcd[7941]: [18ed80] <group/member="root"> failed
to bind to LDAP server ldaps://ldapserver/: Can't contact LDAP server:
Connection timed out
Mar 7 15:40:21 node003 nslcd[7941]: [18ed80] <group/member="root"> no
available LDAP server found: Can't contact LDAP server: Connection timed out
Mar 7 15:40:30 node001 nsl cd[8838]: [53fb78] <group/member="root">
connected to LDAP server ldaps://ldapserver/
Mar 7 15:40:30 node003 nslcd[7941]: [b82726] <group/member="root"> no
available LDAP server found: Server is unavailable: Broken pipe
Mar 7 15:40:30 node003 nslcd[7941]: [b82726] <group/member="root"> no
available LDAP server found: Server is unavailable: Broken pipe

So / was 100%. Yes we should've put var on a separate partition.

As for file descriptor setting we have:
cat /proc/sys/fs/file-max
131072

Is there a way to avoid this in the future?

[slurm-users] Slurm 20.11.3, Suspended new connections while processing backlog filled /

Reply via email to