I see there is this exact issue https://githubmemory.com/repo/dun/munge/issues/94. We are on Slurm 20.11.3 on Bright Cluster 8.1 on Centos 7.9
I found hundreds of these logs in slurmctld: error: slurm_accept_msg_conn: Too many open files in system Then in munged.log: Suspended new connections while processing backlog Also in slurmctld.log: Mar 7 15:40:21 node003 nslcd[7941]: [18ed80] <group/member="root"> failed to bind to LDAP server ldaps://ldapserver/: Can't contact LDAP server: Connection timed out Mar 7 15:40:21 node003 nslcd[7941]: [18ed80] <group/member="root"> no available LDAP server found: Can't contact LDAP server: Connection timed out Mar 7 15:40:30 node001 nsl cd[8838]: [53fb78] <group/member="root"> connected to LDAP server ldaps://ldapserver/ Mar 7 15:40:30 node003 nslcd[7941]: [b82726] <group/member="root"> no available LDAP server found: Server is unavailable: Broken pipe Mar 7 15:40:30 node003 nslcd[7941]: [b82726] <group/member="root"> no available LDAP server found: Server is unavailable: Broken pipe So / was 100%. Yes we should've put var on a separate partition. As for file descriptor setting we have: cat /proc/sys/fs/file-max 131072 Is there a way to avoid this in the future?