Recently encountered an odd issue where some users were getting sporadic 
permission denied on certain directories with their stderr/stdout. We realized 
that this was caused by a change in their nested group permissions on AD 
several days ago.

At first we thought it was the compute nodes themselves so we cleared sssd, 
restarted slurmd and even restarted the node completely. This did not resolve 
the issue. User was able to ssh directly onto the nodes and access the 
directories, this issue only manifest itself when the jobs were going through 
slurm.

We later read on slurm.conf:
disable_send_gids
By default the slurmctld will lookup and send the user_name and extended gids 
for a job, rather than individual on each node as part of each task launch. 
Which avoids issues around name service scalability when launching jobs 
involving many nodes. Using this option will reverse this functionality.

We checked sssd and getent on the slurmctld for the users and they were 
resolving correctly. The fix was to clear sssd and restart slurmctld.

I’m wondering if the slurmctld does some kind of caching with the extended gids 
and if there were a better way of handling this?

Regards,


Luis Huang | Systems Administrator II, Research Computing
New York Genome Center
101 Avenue of the Americas
New York, NY 10013
O: (646) 977-7291
lhu...@nygenome.org<mailto:lhu...@nygenome.org>

________________________________
This message is for the recipient’s use only, and may contain confidential, 
privileged or protected information. Any unauthorized use or dissemination of 
this communication is prohibited. If you received this message in error, please 
immediately notify the sender and destroy all copies of this message. The 
recipient should check this email and any attachments for the presence of 
viruses, as we accept no liability for any damage caused by any virus 
transmitted by this email.

Reply via email to