On 10/11/2018 13.17, Douglas Jacobsen wrote:
We've had issues getting sssd to work reliably on compute nodes (at
least at scale), the reason is not fully understood, but basically if
the connection times out with sssd it'll black list the server for 60s,
which then causes those kinds of issues.
In our experience sssd doesn't work reliably in large environments if
user/group enumeration is enabled (the "enumerate" config option).
slurm used to require enumeration, but in
https://github.com/SchedMD/slurm/commit/48a4cdf8d9433b5655a26581768200e7a696ce87
I reworked the logic so that it should only be required in some special
weird cases. But that patch was several years ago, hopefully whatever
bugs were caused by it have been ironed out by now (*knocking on wood*).
Setting LaunchParameters=send_gids will sidestep this issue by doing the
lookups exclusively on the controller node, where more frequent
connections can prevent time decay disconnections and reduce the
likelihood of cache misses.
This is probably good idea particularly if one has large parallel jobs,
otherwise the nodes could DOS the AD/LDAP servers when launching if the
cache is cold..
--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi