[slurm-users] Re: Performance Issues after Update to 24.11.5

Tilman Hoffbauer via slurm-users Tue, 13 May 2025 07:30:26 -0700

Actually, you were right. By setting LLMNR=no in/etc/systemd/resolved.conf on g-vm03, which turns off link-localmulticast name resolution, we were able to speed up getent hosts ougaXXsignificantly, which solves the issue. Thanks!


On 5/13/25 15:50, John Hearns via slurm-users wrote:

I think that looks OK. Forget my response.

On Tue, 13 May 2025 at 14:09, Tilman Hoffbauer via slurm-users<slurm-users@lists.schedmd.com> wrote:


    Thank you for your response. nslookup on e.g. ouga20 is instant,
    getent hosts ouga20 takes about 1.6 seconds from g-vm03. It is
    about the same speed for ouga20 looking up g-vm03.

    Is this too slow?

    On 5/13/25 15:01, John Hearns wrote:

    Stupid response from me.  A loooong time ago I ha issues with
    slow response on PBS. The cause was name resolution.

    On your setup is name resolution OK? Can you look up host names
    without delays?

    On Tue, 13 May 2025 at 13:50, Tilman Hoffbauer via slurm-users
    <slurm-users@lists.schedmd.com> wrote:

        Hello,

        we are running a SLURM-managed cluster with one control node
        (g-vm03) and 26 worker nodes (ouga[03-28]) on Rocky 8. We
        recently updated from 20.11.9 through 23.02.8 to 24.11.0 and
        then 24.11.5. Since then, we are experiencing performance
        issues - squeue and scontrol ping are slow to react and
        sometimes deliver "timeout on send/recv" messages, even with
        only very few parallel requests. We did not experience these
        issues with SLURM 20.11.9 before, we did not check the
        intermediate version 23.02.8 in detail before. In the log of
        slurmctld, we can also find messages like

        slurmctld: error: slurm_send_node_msg: [socket:[1272743]]
        slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed:
        Unexpected missing socket error

        We thus implemented all recommendations from the high
        throughput documentation, and did achieve improvements with
        it (most notably by increasing the maximum number of open
        files and increasing MessageTimeout and TCPTimeout).

        For debugging, I attached the slurm.conf, the sdiag output
        (the server thread count is almost always 1 and sometimes
        increases to 2), the slurmctld log and the slurmdbd log from
        a time of high load.

        We would be very thankful for any input on how restore the
        old performance.

        Kind Regards,
        Tilman Hoffbauer

--slurm-users mailing list -- slurm-users@lists.schedmd.com

        To unsubscribe send an email to
        slurm-users-le...@lists.schedmd.com

--slurm-users mailing list -- slurm-users@lists.schedmd.com

    To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Performance Issues after Update to 24.11.5

Reply via email to