Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Christopher Benjamin Coffey Wed, 12 Jun 2019 07:39:47 -0700

Hi, you may want to look into increasing the sssd cache length on the nodes, 
and improving the network connectivity to your ldap directory. I recall when 
playing with sssd in the past that it wasn't actually caching. Verify with 
tcpdump, and "ls -l" through a directory. Once the uid/gid is resolved, it 
shouldn't be hitting the directory anymore till the cache expires.


Do the nodes NAT through the head node?

Best,
Chris 

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 

On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
<slurm-users-boun...@lists.schedmd.com on behalf of b.h.me...@usit.uio.no> 
wrote:

    Another possible cause (we currently see it on one of our clusters):
    delays in ldap lookups.
    
    We have sssd on the machines, and occasionally, when sssd contacts the
    ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
    answer.  If that happens because slurmctld is trying to look up some
    user or group, etc, client commands depending on it will hang.  The
    default message timeout is 10 seconds, so if the delay is more than
    that, you get the timeout error.
    
    We don't know why the delays are happening, but while we are debugging
    it, we've increased the MessageTimeout, which seems to have reduced the
    problem a bit.  We're also experimenting with GroupUpdateForce and
    GroupUpdateTime to reduce the number of times slurmctld needs to ask
    about groups, but I'm unsure how much that helps.
    
    -- 
    Bjørn-Helge Mevik, dr. scient,
    Department for Research Computing, University of Oslo

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Reply via email to