[slurm-users] Rootless Docker Errors with Slurm
I am trying to integrate Rootless Docker with Slurm. have set-up Rootless Docker as per the docs "https://slurm.schedmd.com/containers.html"; . I have scrum.lua, oci.conf (for crun) and slurm.conf in place. Then "~/.config/docker/daemon.json" and "~/.config/systemd/user/docker.service.d/override.conf" are in place too. But I can't seem to get it to work: $ docker run $DOCKER_SECURITY alpine /bin/printenv SLURM_JOB_ID Unable to find image 'alpine:latest' locally latest: Pulling from library/alpine 4abcf2066143: Pull complete Digest: sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b Status: Downloaded newer image for alpine:latest docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/user/1000/docker-exec/containerd/daemon/io.containerd.runtime.v2.task/moby/97e8dd767977ac03ab7af54c015c0fd5dfd26e737771b977acb7e41f799023aa/log.json: no such file or directory): /usr/bin/scrun did not terminate successfully: exit status 1: unknown. One thing is if I don't use Slurm as the runtime for Docker (if I remove the " ~/.config/docker/daemon.json") then docker runs fine -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm With Podman - No child processes error
I have integrated Podman with Slurm as per the docs ( https://slurm.schedmd.com/containers.html#podman-scrun) and when I do a test run: "podman run hello-world" (this runs fine) $ podman run alpine hostname executable file `/usr/bin/hostname` not found in $PATH: No such file or directory srun: error: slurm1: task 0: Exited with exit code 1 - $ podman run alpine printenv SLURM_JOB_ID executable file `/usr/bin/printenv` not found in $PATH: No such file or directory srun: error: slurm1: task 0: Exited with exit code 1 scrun: error: run_command_waitpid_timeout: waitpid(67537): No child processes --- podman run alpine uptime 11:31:28 up 5:32, 0 users, load average: 0.00, 0.00, 0.00 scrun: error: run_command_waitpid_timeout: waitpid(68160): No child processes -- I built a small image from python:alpine3.19 which just prints "hello world" and numbers from 1 to 10. Here is a run: $ podman run -it --rm hello-python $ podman run -it --rm hello-python Hello, world! Numbers from 1 to 10: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] No error with my image. Also I tested podman on another machine without Slurm. Podman with its default runtime prints the hostname fine with "podman run alpine hostname". So something to do with its integration with Slurm. What can I do to diagnose the problem? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] RunTimeQuery never configured in oci.conf
I am using Slurm integrated with Podman. It runs the container fine and the Controller daemon log always says "WEXITSTATUS 0" . Container also runs successfully (it runs the python test program with no errors). But there are two things that I noticed: - slurmd.log says: " error: _get_container_state: RunTimeQuery failed rc:256 output:RunTimeQuery never configured in oci.conf" - podman --log-level=debug reports: "[conmon:d]: failed to write to /proc/self/oom_score_adj: Permission denied" How can I diagnose further to know what's wrong -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Which "oci.conf" to use?
I have installed slurm and podman. I have replaced podman's default runtime as per the documentation to "slurm". Documentation says I need to choose one oci.conf: https://slurm.schedmd.com/containers.html#example Which one should I use? runc? crun? nvidia? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?
I am using the latest slurm. It runs fine for scripts. But if I give it a container then it kills it as soon as I submit the job. Is slurm cleaning up the $XDG_RUNTIME_DIR before it should? This is the log: [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 TaskId=-1 [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[0]=/bin/sh [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[1]=-c [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command argv[2]=crun --rootless=true --root=/run/user/1000/ state slurm2.acog.90.0.-1 [2024-05-15T08:00:35.167] [90.0] debug: _get_container_state: RunTimeQuery rc:256 output:error opening file `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory [2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery failed rc:256 output:error opening file `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory [2024-05-15T08:00:35.167] [90.0] debug: container already dead [2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0 pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/ [2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0 TaskId=0 [2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1 pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/ [2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error) [2024-05-15T08:00:35.275] debug3: in the service_connection [2024-05-15T08:00:35.278] debug2: Start processing RPC: REQUEST_TERMINATE_JOB [2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2024-05-15T08:00:35.278] debug: _rpc_terminate_job: uid = 64030 JobId=90 [2024-05-15T08:00:35.278] debug: credential for job 90 revoked -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?
Hi Ward, Thanks for replying. I tried these but the error is exactly the same (everything under "/shared" has permissions 777 and owned by "nobody:nogroup"): /etc/slurm/slurm.conf JobContainerType=job_container/tmpfs Prolog=/shared/SlurmScripts/prejob PrologFlags=contain /etc/slurm/job_container.conf # AutoBasePath=true BasePath=/shared/BasePath /shared/SlurmScripts/prejob #!/usr/bin/env bash MY_XDG_RUNTIME_DIR=/shared/SlurmXDG mkdir -p $MY_XDG_RUNTIME_DIR echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR" On Wed, May 15, 2024 at 2:28 PM Ward Poelmans via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi, > > This is systemd, not slurm. We've also seen it being created and removed. > As far as I understood something about the session that systemd clean up. > We've worked around by adding this to the prolog: > > MY_XDG_RUNTIME_DIR=/dev/shm/${USER} > mkdir -p $MY_XDG_RUNTIME_DIR > echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR" > > (in combination with private tmpfs per job). > > Ward > > On 15/05/2024 10:14, Arnuld via slurm-users wrote: > > I am using the latest slurm. It runs fine for scripts. But if I give it > a container then it kills it as soon as I submit the job. Is slurm cleaning > up the $XDG_RUNTIME_DIR before it should? This is the log: > > > > [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0 > TaskId=-1 > > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command > argv[0]=/bin/sh > > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command > argv[1]=-c > > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command > argv[2]=crun --rootless=true --root=/run/user/1000/ state > slurm2.acog.90.0.-1 > > [2024-05-15T08:00:35.167] [90.0] debug: _get_container_state: > RunTimeQuery rc:256 output:error opening file > `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory > > > > [2024-05-15T08:00:35.167] [90.0] error: _get_container_state: > RunTimeQuery failed rc:256 output:error opening file > `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory > > > > [2024-05-15T08:00:35.167] [90.0] debug: container already dead > > [2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0 > pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/ > > [2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0 > TaskId=0 > > [2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1 > pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/ > > [2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step > (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error) > > [2024-05-15T08:00:35.275] debug3: in the service_connection > > [2024-05-15T08:00:35.278] debug2: Start processing RPC: > REQUEST_TERMINATE_JOB > > [2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB > > [2024-05-15T08:00:35.278] debug: _rpc_terminate_job: uid = 64030 > JobId=90 > > [2024-05-15T08:00:35.278] debug: credential for job 90 revoked > > > > > > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Building Slurm debian package vs building from source
We have several nodes, most of which have different Linux distributions (distro for short). Controller has a different distro as well. The only common thing between controller and all the does is that all of them ar x86_64. I can install Slurm using package manager on all the machines but this will not work because controller will have a different version of Slurm compared to the nodes (21.08 vs 23.11) If I build from source then I see two solutions: - build a deb package - build a custom package (./configure, make, make install) Building a debian package on the controller and then distributing the binaries on nodes won't work either because that binary will start looking for the shared libraries that it was built for and those don't exist on the nodes. So the only solution I have is to build a static binary using a custom package. Am I correct or is there another solution here? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Building Slurm debian package vs building from source
> Not that I recommend it much, but you can build them for each > environment and install the ones needed in each. Oh cool, I will download the latest version 23.11.7 and build debian packages on every machine then > A simple example is when you have nodes with and without GPUs. > You can build slurmd packages without for those nodes and with for the > ones that have them. I do have non-gpu machines. I guess I need to learn to modify the debian Control files for this > Generally, so long as versions are compatible, they can work together. > You will need to be aware of differences for jobs and configs, but it is > possible. you mean the versions of the dependencies are compatible? It is true for most (like munge) but might not be true for others like (yaml or http-parser). I need to check on that. On Thu, May 23, 2024 at 1:07 AM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote: > Not that I recommend it much, but you can build them for each > environment and install the ones needed in each. > > A simple example is when you have nodes with and without GPUs. > You can build slurmd packages without for those nodes and with for the > ones that have them. > > Generally, so long as versions are compatible, they can work together. > You will need to be aware of differences for jobs and configs, but it is > possible. > > Brian Andrus > > On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote: > > We have several nodes, most of which have different Linux > > distributions (distro for short). Controller has a different distro as > > well. The only common thing between controller and all the does is > > that all of them ar x86_64. > > > > I can install Slurm using package manager on all the machines but this > > will not work because controller will have a different version of > > Slurm compared to the nodes (21.08 vs 23.11) > > > > If I build from source then I see two solutions: > > - build a deb package > > - build a custom package (./configure, make, make install) > > > > Building a debian package on the controller and then distributing the > > binaries on nodes won't work either because that binary will start > > looking for the shared libraries that it was built for and those don't > > exist on the nodes. > > > > So the only solution I have is to build a static binary using a custom > > package. Am I correct or is there another solution here? > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Building Slurm debian package vs building from source
> n fact I am more worried about how the users would benefit > from such a mixture of execution environments > ...SNIIP So what is an ideal setup? Keep the same .deb distro on all machines and use apt to install slurm on every machine? On Thu, May 23, 2024 at 10:20 AM Shunran Zhang < szh...@ngs.gen-info.osaka-u.ac.jp> wrote: > Hi Arnuld, > > What I would probably do is to build one for each distro and install them > either directly into /usr/local or using deb package. > > The DEBIAN/control is used by apt to manage a couple of things, such as > indexing so apt search shows what this package is for, which package it > could replace, which packages are recommended to be installed with it, and > which packages need to be installed before this can work. > > For those machines with a certain brand of GPU, you would need a slurm > that is configured and compiled with such option ON, and such device driver > in the DEBIAN/control to allow apt to check the driver on the machine meets > the requirement of your deb package. You can forget about the second part > if you are not using deb packages and just compile - run the slurm on the > client machine. > > The last thing he mentioned is about the slurm versions. A slurm client of > lower version (say 23.02) should be able to talk to a slurmctld of higher > version (say 23.11) just fine, though the reverse do not apply. For > dependency management it is of such complexity that maintaining a > distribution of Linux is quite some work - I knew it as I am a maintainer > of a Linux distro that uses dpkg packages, but without a debian root and > uses a different cli tool etc. > > In fact I am more worried about how the users would benefit from such a > mixture of execution environments - a misstep in configuration or a user > submitting job without specifying enough info on what they asks for would > probably make the user's job works or does not work purely by chance of > which node it got executed, and which environment the job's executables are > built against. It probably need a couple of "similar" nodes to allow users > benefiting from the job queue to send their job to the place where > available. > > Good luck with your setup > > Sincerely, > > S. Zhang > On 2024/05/23 13:04, Arnuld via slurm-users wrote: > > > Not that I recommend it much, but you can build them for each > > environment and install the ones needed in each. > > Oh cool, I will download the latest version 23.11.7 and build debian > packages on every machine then > > > > A simple example is when you have nodes with and without GPUs. > > You can build slurmd packages without for those nodes and with for the > > ones that have them. > > I do have non-gpu machines. I guess I need to learn to modify the debian > Control files for this > > > > Generally, so long as versions are compatible, they can work together. > > You will need to be aware of differences for jobs and configs, but it is > > possible. > > you mean the versions of the dependencies are compatible? It is true for > most (like munge) but might not be true for others like (yaml or > http-parser). I need to check on that. > > > On Thu, May 23, 2024 at 1:07 AM Brian Andrus via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> Not that I recommend it much, but you can build them for each >> environment and install the ones needed in each. >> >> A simple example is when you have nodes with and without GPUs. >> You can build slurmd packages without for those nodes and with for the >> ones that have them. >> >> Generally, so long as versions are compatible, they can work together. >> You will need to be aware of differences for jobs and configs, but it is >> possible. >> >> Brian Andrus >> >> On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote: >> > We have several nodes, most of which have different Linux >> > distributions (distro for short). Controller has a different distro as >> > well. The only common thing between controller and all the does is >> > that all of them ar x86_64. >> > >> > I can install Slurm using package manager on all the machines but this >> > will not work because controller will have a different version of >> > Slurm compared to the nodes (21.08 vs 23.11) >> > >> > If I build from source then I see two solutions: >> > - build a deb package >> > - build a custom package (./configure, make, make install) >> > >> > Building a debian package on the controller and then distributing the >> > binaries on nodes wo
[slurm-users] Slurm Build Error
Getting this error when I run "make install": echo >>"lib_ref.lo" /bin/bash ../../libtool --tag=CC --mode=link gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o lib_ref.la lib_ref.lo -lpthread -lm -lresolv libtool: link: ar cr .libs/lib_ref.a libtool: link: ranlib .libs/lib_ref.a libtool: link: ( cd ".libs" && rm -f "lib_ref.la" && ln -s "../lib_ref.la" " lib_ref.la" ) /bin/bash ../../libtool --tag=CC --mode=link gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -export-dynamic -o sacctmgr account_functions.o archive_functions.o association_functions.o config_functions.o cluster_functions.o common.o event_functions.o federation_functions.o file_functions.o instance_functions.o runaway_job_functions.o job_functions.o reservation_functions.o resource_functions.o sacctmgr.o qos_functions.o txn_functions.o user_functions.o wckey_functions.o problem_functions.o tres_function.o -Wl,-rpath=/root/slurm-slurm-23-11-7-1/z/lib/slurm -L../../src/api/.libs -lslurmfull -export-dynamic -lreadline -lhistory lib_ref.la -lpthread -lm -lresolv libtool: link: gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing -o sacctmgr account_functions.o archive_functions.o association_functions.o config_functions.o cluster_functions.o common.o event_functions.o federation_functions.o file_functions.o instance_functions.o runaway_job_functions.o job_functions.o reservation_functions.o resource_functions.o sacctmgr.o qos_functions.o txn_functions.o user_functions.o wckey_functions.o problem_functions.o tres_function.o -Wl,-rpath=/root/slurm-slurm-23-11-7-1/z/lib/slurm -Wl,--export-dynamic -L../../src/api/.libs /root/slurm-slurm-23-11-7-1/src/api/.libs/libslurmfull.a -lreadline -lhistory ./.libs/lib_ref.a -lpthread -lm -lresolv -pthread /usr/bin/ld: sacctmgr.o: warning: relocation against `_binary_usage_txt_end' in read-only section `.text' /usr/bin/ld: sacctmgr.o: in function `_usage': /root/slurm-slurm-23-11-7-1/src/sacctmgr/sacctmgr.c:926:(.text+0x1f): undefined reference to `_binary_usage_txt_start' /usr/bin/ld: /root/slurm-slurm-23-11-7-1/src/sacctmgr/sacctmgr.c:926:(.text+0x26): undefined reference to `_binary_usage_txt_end' /usr/bin/ld: warning: creating DT_TEXTREL in a PIE collect2: error: ld returned 1 exit status make[3]: *** [Makefile:672: sacctmgr] Error 1 make[3]: Leaving directory '/root/slurm-slurm-23-11-7-1/src/sacctmgr' make[2]: *** [Makefile:520: all-recursive] Error 1 make[2]: Leaving directory '/root/slurm-slurm-23-11-7-1/src' make[1]: *** [Makefile:621: all-recursive] Error 1 make[1]: Leaving directory '/root/slurm-slurm-23-11-7-1' make: *** [Makefile:520: all] Error 2 - I used these config options: ./configure --enable-debug --enable-pam --disable-sview --disable-shared --with-munge --with-json --with-yaml --with-http-parser --with-pmix --with-lz4 --with-hwloc --with-jwt --with-libcurl --with-freeipmi --with-rdkafka --with-bpf --prefix=/root/slurm-slurm-23-11-7-1/z -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] error: unpack_header: protocol_version 9472 not supported
I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04. While Slurm runs fine on one machine, on the 2nd machine it does not. First machine is both a controller and a node while the 2nd machine is just a node. On both machines, I built the Slurm debian package as per the Slurm docs instructions. Slurmd logs show this: error: unpack_header: protocol_version 9472 not supported error: unpacking header error: destroy_forward: no init error: slurm_receive_msg_and_forward: [[host-4.attlocal.net]:38960] failed: Message receive failure error: service_connection: slurm_receive_msg: Message receive failure debug: _service_connection: incomplete message -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] srun hostname - Socket timed out on send/recv operation
I have two machines. When I run "srum hostname" on one machine (it's both a controller and a node) then I get the hostname fine but I get socket timed out error in these two situations: 1) "srun hostname" on 2nd machine (it's a node) 2) "srun -N 2 hostname" on controller "scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes too. Also the job gets stuck forever in CG state after the error. Here is the output: $ srun -N 2 hostname mach2 srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out on send/recv operation srun: error: Task launch for StepId=.0 failed on node hpc4: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted Output form "squeue" 3 seconds apart Tue Jun 11 05:09:56 2024 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) poxo hostname arnuld R 0:19 2 mach4,mach2 Tue Jun 11 05:09:59 2024 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) poxo hostname arnuld CG 0:20 1 mach4 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: srun hostname - Socket timed out on send/recv operation
I enabled "debug3" logging and saw this in the node log: error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from plugin_id=106. This error usually results from a job being submitted against an MPI plugin which was not compiled into slurmd but was for job submission command. error: _send_slurmstepd_init: mpi_conf_send_stepd(9, 106) failed: No error I removed "MpiDefault" option from slurm.conf and now "srun -N2 -l hostname" returns hostnames of all machines On Tue, Jun 11, 2024 at 11:05 AM Arnuld wrote: > I have two machines. When I run "srum hostname" on one machine (it's both > a controller and a node) then I get the hostname fine but I get socket > timed out error in these two situations: > > 1) "srun hostname" on 2nd machine (it's a node) > 2) "srun -N 2 hostname" on controller > > "scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes > too. Also the job gets stuck forever in CG state after the error. Here is > the output: > > $ srun -N 2 hostname > mach2 > srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out > on send/recv operation > srun: error: Task launch for StepId=.0 failed on node hpc4: Socket > timed out on send/recv operation > srun: error: Application launch failed: Socket timed out on send/recv > operation > srun: Job step aborted > > > Output form "squeue" 3 seconds apart > > Tue Jun 11 05:09:56 2024 > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > poxo hostname arnuld R 0:19 2 > mach4,mach2 > > Tue Jun 11 05:09:59 2024 > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > poxo hostname arnuld CG 0:20 1 mach4 > > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Debian RPM build for arm64?
I dont' know much about Slurm but if you want to start troubleshooting then you need to isolate the step where error appears. From the output you have posted , it looks like you are using some automated script to download, extract and build Slurm. Look here: "/bin/sh -c cd /tmp && wget https://download.schedmd.com/slurm/slurm-23.11.7.tar.bz2 && tar -xaf slurm-23.11.7.tar.bz2 && cd slurm-23.11.7 && mk-build-deps -t \"apt-get -o Debug::pkgProblemResolver=yes -y\" -i debian/control && debuild -b -uc -us && . Here 6 steps have combined together with &&. I would do these steps by hand, manually, one by one and see where error occurs. You might get some extra information about the error this way On Fri, Jun 14, 2024 at 4:11 AM Christopher Harrop - NOAA Affiliate via slurm-users wrote: > Hello, > > Are the instructions for building Debian RPMs found at > https://slurm.schedmd.com/quickstart_admin.html#debuild expected to work > on ARM machines? > > I am having trouble with the "debuild -b -uc -us” step. > > #10 29.01 configure: exit 1 > #10 29.01 dh_auto_configure: error: cd obj-aarch64-linux-gnu && > ../configure --build=aarch64-linux-gnu --prefix=/usr > --includedir=\${prefix}/include --mandir=\${prefix}/share/man > --infodir=\${prefix}/share/info --sysconfdir=/etc --localstatedir=/var > --disable-silent-rules --libdir=\${prefix}/lib/aarch64-linux-gnu > --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking > --sysconfdir=/etc/slurm --disable-debug --with-slurmrestd --with-pmix > --enable-pam --with-systemdsystemunitdir=/lib/systemd/system/ SUCMD=/bin/su > SLEEP_CMD=/bin/sleep returned exit code 1 > #10 29.01 make[1]: *** [debian/rules:21: override_dh_auto_configure] Error > 25 > #10 29.01 make[1]: Leaving directory '/tmp/slurm-23.11.7' > #10 29.02 make: *** [debian/rules:6: build] Error 2 > #10 29.02 dpkg-buildpackage: error: debian/rules build subprocess returned > exit status 2 > #10 29.02 debuild: fatal error at line 1182: > #10 29.02 dpkg-buildpackage -us -uc -ui -b failed > #10 ERROR: process "/bin/sh -c cd /tmp && wget > https://download.schedmd.com/slurm/slurm-23.11.7.tar.bz2 && tar -xaf > slurm-23.11.7.tar.bz2 && cd slurm-23.11.7 && mk-build-deps -t \"apt-get > -o Debug::pkgProblemResolver=yes -y\" -i debian/control && debuild -b -uc > -us && cd .. && ARCH=$(dpkg --print-architecture) && dpkg --install > slurm-smd_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-client_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-dev_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-doc_23.11.7-1_all.deb && dpkg --install > slurm-smd-libnss-slurm_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-libpam-slurm-adopt_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-libpmi0_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-libpmi2-0_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-libslurm-perl_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-sackd_23.11.7-1_${ARCH}.deb && dpkg --install > slurm-smd-sview_23.11.7-1_${ARCH}.deb" did not complete successfully: exit > code: 29 > > Chris > > --- > Christopher W. Harrop > voice: (720) 649-0316 > NOAA Global Systems Laboratory, R/GSL6 fax: (303) > 497-7259 > 325 Broadway > Boulder, CO 80303 > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Can Not Use A Single GPU for Multiple Jobs
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores. I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based jobs). PROBLEM: Each job asks for only 100 shards (runs usually for a minute or so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm runs only 4 jobs in parallel keeping the rest in the queue. I have this in slurm.conf and gres.conf: # GPU GresTypes=gpu,shard # COMPUTE NODES PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=64255 State=UNKNOWN -- Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 Name=shard Count=3500 File=/dev/nvidia0 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs
> Every job will need at least 1 core just to run > and if there are only 4 cores on the machine, > one would expect a max of 4 jobs to run. I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? This sbatch script requires 100 GPU cores, can;t we run 35 in parallel? #! /usr/bin/env bash #SBATCH --output="%j.out" #SBATCH --error="%j.error" #SBATCH --partition=pgpu #SBATCH --gres=shard:100 sleep 10 echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")" echo "Running..." sleep 10 On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote: > Well, if I am reading this right, it makes sense. > > Every job will need at least 1 core just to run and if there are only 4 > cores on the machine, one would expect a max of 4 jobs to run. > > Brian Andrus > > On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote: > > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ > > cores. I want to run around 10 jobs in parallel on the GPU (mostly > > are CUDA based jobs). > > > > PROBLEM: Each job asks for only 100 shards (runs usually for a minute > > or so), then I should be able to run 3500/100 = 35 jobs in > > parallel but slurm runs only 4 jobs in parallel keeping the rest in > > the queue. > > > > I have this in slurm.conf and gres.conf: > > > > # GPU > > GresTypes=gpu,shard > > # COMPUTE NODES > > PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP` > > PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP > > NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 > > CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 > > RealMemory=64255 State=UNKNOWN > > -- > > Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1 > > Name=shard Count=3500 File=/dev/nvidia0 > > > > > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs
> No, Slurm has to launch the batch script on compute node cores > ... SNIP... > Even with srun directly from a login node there's still processes that > have to run on the compute node and those need at least a core > (and some may need more, depending on the application). Alright, understood. On Sat, Jun 22, 2024 at 12:47 AM Christopher Samuel via slurm-users < slurm-users@lists.schedmd.com> wrote: > On 6/21/24 3:50 am, Arnuld via slurm-users wrote: > > > I have 3500+ GPU cores available. You mean each GPU job requires at > > least one CPU? Can't we run a job with just GPU without any CPUs? > > No, Slurm has to launch the batch script on compute node cores and it > then has the job of launching the users application that will run > something on the node that will access the GPU(s). > > Even with srun directly from a login node there's still processes that > have to run on the compute node and those need at least a core (and some > may need more, depending on the application). > > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: error: unpack_header: protocol_version 9472 not supported
I found the problem. It was not that this node was trying to reach some machine. It was the other way around, some other machine (running controller) had this node in the config there, and hence that controller was trying to reach to this. It was a different slurm cluster. I removed the config from there and all is fine now. On Wed, Jun 5, 2024 at 1:12 PM Arnuld wrote: > I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04. > While Slurm runs fine on one machine, on the 2nd machine it does not. First > machine is both a controller and a node while the 2nd machine is just a > node. On both machines, I built the Slurm debian package as per the Slurm > docs instructions. Slurmd logs show this: > > error: unpack_header: protocol_version 9472 not supported > error: unpacking header > error: destroy_forward: no init > error: slurm_receive_msg_and_forward: [[host-4.attlocal.net]:38960] > failed: Message receive failure > error: service_connection: slurm_receive_msg: Message receive failure > debug: _service_connection: incomplete message > > > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Using sharding
> On Fri, Jul 5, 2024 at 12:19 PM Ward Poelmans > via slurm-users wrote: > Hi Ricardo, > > It should show up like this: > > Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) > What's the meaning of (S:0-1) here? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com