[slurm-users] Re: Crash in "slurmd -C" when latest NVIDIA drivers are used

Brian Andrus via slurm-users Tue, 20 May 2025 08:22:53 -0700

I can't speak to the exact cause, but I did find that updating my cudatoolkit fixed issues I saw with that awhile back.


I install:


   libnvidia-compute-570-server

   nvidia-cuda-toolkit

fromhttps://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/



That ends up grabbing the latest cuda bits which support the newer drivers.


Brian Andrus


On 5/19/2025 12:50 PM, Taras Shapovalov via slurm-users wrote:

Hello,
Does someone have idea why "slurmd -C" crashes when it unloadsgpu_nrt.so with latest NVIDIA drivers (570 and 575)? We checked, thereis no crash in cuda at the moment and gpu_nvml.so works fine, all nvmlcalls finish successfully, dlclose on gpu_nvml.so works fine. Thecrash does not depend whether real GPUs present or not.
Steps to reproduce:

 1. Install Ubuntu 24.04
 2. wget _https://download.schedmd.com/slurm/slurm-24.11.4.tar.bz2
    <https://download.schedmd.com/slurm/slurm-24.11.4.tar.bz2;>_
 3. tar fx ./slurm-24.11.4.tar.bz2
 4. cd slurm-24.11.4
5.
    apt-get install cuda-12-8 hwloc libmunge-dev -y
 6. ./configure
7.
    make && make install
8.
    Run "slurmd -C", or sometimes "slurmd -vvv -C" to get the crash.


Stack trace:
#0 0x0000155555544b2a strlen (ld-linux-x86-64.so.2 +0x28b2a) #1 0x000015555551fc08 __GI__dl_exception_create(ld-linux-x86-64.so.2 + 0x3c08) #2 0x000015555551d298 __GI__dl_signal_error(ld-linux-x86-64.so.2 + 0x1298) #3 0x000015555551e81d _dl_close (ld-linux-x86-64.so.2+ 0x281d) #4 0x000015555551d51c __GI__dl_catch_exception(ld-linux-x86-64.so.2 + 0x151c) #5 0x000015555551d669 _dl_catch_error(ld-linux-x86-64.so.2 + 0x1669)
                #6  0x0000155554e97c73 _dlerror_run (libc.so.6 + 0x97c73)
                #7  0x0000155554e979a6 __dlclose (libc.so.6 + 0x979a6)
#8 0x0000155555388a25 gpu_plugin_fini(libslurmfull.so + 0x188a25) #9 0x000015555538f2ef gres_get_autodetected_gpus(libslurmfull.so + 0x18f2ef)
                #10 0x0000555555564828 _print_config (slurmd + 0x10828)
#11 0x0000155554e2a1ca __libc_start_call_main(libc.so.6 + 0x2a1ca) #12 0x0000155554e2a28b __libc_start_main_impl(libc.so.6 + 0x2a28b)
                #13 0x000055555555fc75 _start (slurmd + 0xbc75)
I don't really think the problem is in gpu_nrt itself, seems theproblem is in memory corruption somewhere else, but I am not sure. Theissue is reproduced constantly. A
Best regards,

Taras

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Crash in "slurmd -C" when latest NVIDIA drivers are used

Reply via email to