On 10/11/2018 13.17, Douglas Jacobsen wrote:
We've had issues getting sssd to work reliably on compute nodes (at least at scale), the reason is not fully understood, but basically if the connection times out with sssd it'll black list the server for 60s, which then causes those kinds of issues.

In our experience sssd doesn't work reliably in large environments if user/group enumeration is enabled (the "enumerate" config option).

slurm used to require enumeration, but in

https://github.com/SchedMD/slurm/commit/48a4cdf8d9433b5655a26581768200e7a696ce87

I reworked the logic so that it should only be required in some special weird cases. But that patch was several years ago, hopefully whatever bugs were caused by it have been ironed out by now (*knocking on wood*).

Setting LaunchParameters=send_gids will sidestep this issue by doing the lookups exclusively on the controller node, where more frequent connections can prevent time decay disconnections and reduce the likelihood of cache misses.

This is probably good idea particularly if one has large parallel jobs, otherwise the nodes could DOS the AD/LDAP servers when launching if the cache is cold..


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi

Reply via email to