[slurm-users] Rootless Docker Errors with Slurm

2024-05-06 Thread ARNULD via slurm-users
I am trying to integrate Rootless Docker with Slurm.  have set-up Rootless
Docker as per the docs "https://slurm.schedmd.com/containers.html"; . I have
scrum.lua, oci.conf (for crun)  and slurm.conf in place. Then
"~/.config/docker/daemon.json" and
"~/.config/systemd/user/docker.service.d/override.conf" are in place too.
But I can't seem to get it to work:

$ docker run $DOCKER_SECURITY alpine /bin/printenv SLURM_JOB_ID
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
4abcf2066143: Pull complete
Digest:
sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: failed to create task for container:
failed to create shim task: OCI runtime create failed: unable to retrieve
OCI runtime error (open
/run/user/1000/docker-exec/containerd/daemon/io.containerd.runtime.v2.task/moby/97e8dd767977ac03ab7af54c015c0fd5dfd26e737771b977acb7e41f799023aa/log.json:
no such file or directory): /usr/bin/scrun did not terminate successfully:
exit status 1: unknown.

One thing is if I don't use Slurm as the runtime for Docker (if I remove
the " ~/.config/docker/daemon.json") then docker runs fine

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm With Podman - No child processes error

2024-05-08 Thread ARNULD via slurm-users
I have integrated Podman with Slurm as per the docs (
https://slurm.schedmd.com/containers.html#podman-scrun) and when I do a
test run:

"podman run hello-world" (this runs fine)


$ podman run alpine hostname
executable file `/usr/bin/hostname` not found in $PATH: No such file or
directory
srun: error: slurm1: task 0: Exited with exit code 1
-
$ podman run alpine printenv SLURM_JOB_ID
executable file `/usr/bin/printenv` not found in $PATH: No such file or
directory
srun: error: slurm1: task 0: Exited with exit code 1
scrun: error: run_command_waitpid_timeout: waitpid(67537): No child
processes
---
podman run alpine uptime
 11:31:28 up  5:32,  0 users,  load average: 0.00, 0.00, 0.00
scrun: error: run_command_waitpid_timeout: waitpid(68160): No child
processes
--

I built a small image from python:alpine3.19 which just prints "hello
world" and numbers from 1 to 10. Here is a run:

$ podman run -it --rm hello-python
$ podman run -it --rm hello-python
Hello, world!
Numbers from 1 to 10: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


No error with my image. Also I tested podman on another machine without
Slurm. Podman with its default runtime prints the hostname fine with
"podman run alpine hostname". So something to do with its integration with
Slurm.

What can I do to diagnose the problem?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] RunTimeQuery never configured in oci.conf

2024-05-10 Thread Arnuld via slurm-users
I am using Slurm integrated with Podman. It runs the container fine and the
Controller daemon log always says "WEXITSTATUS 0" . Container also runs
successfully  (it runs the python test program with no errors).

But there are two things that I noticed:

 - slurmd.log says: " error: _get_container_state: RunTimeQuery failed
rc:256 output:RunTimeQuery never configured in oci.conf"

 - podman --log-level=debug reports: "[conmon:d]: failed to write to
/proc/self/oom_score_adj: Permission denied"

How can I diagnose further to know what's wrong

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Which "oci.conf" to use?

2024-05-13 Thread Arnuld via slurm-users
I have installed slurm and podman. I have replaced podman's default runtime
as per the documentation to "slurm". Documentation says I need to choose
one oci.conf:

https://slurm.schedmd.com/containers.html#example

Which one should I use? runc? crun? nvidia?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Arnuld via slurm-users
I am using the latest slurm. It  runs fine for scripts. But if I give it a
container then it kills it as soon as I submit the job. Is slurm cleaning
up the $XDG_RUNTIME_DIR before it should?  This is the log:

[2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=-1
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[0]=/bin/sh
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[1]=-c
[2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
argv[2]=crun --rootless=true --root=/run/user/1000/ state
slurm2.acog.90.0.-1
[2024-05-15T08:00:35.167] [90.0] debug:  _get_container_state: RunTimeQuery
rc:256 output:error opening file
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory

[2024-05-15T08:00:35.167] [90.0] error: _get_container_state: RunTimeQuery
failed rc:256 output:error opening file
`/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory

[2024-05-15T08:00:35.167] [90.0] debug:  container already dead
[2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0
pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
[2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0
TaskId=0
[2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1
pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
[2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step
(rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
[2024-05-15T08:00:35.275] debug3: in the service_connection
[2024-05-15T08:00:35.278] debug2: Start processing RPC:
REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2024-05-15T08:00:35.278] debug:  _rpc_terminate_job: uid = 64030 JobId=90
[2024-05-15T08:00:35.278] debug:  credential for job 90 revoked

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Slurm Cleaning Up $XDG_RUNTIME_DIR Before It Should?

2024-05-15 Thread Arnuld via slurm-users
Hi Ward,

Thanks for replying. I tried these but the error is exactly the same
(everything
under "/shared" has permissions 777 and owned by "nobody:nogroup"):

/etc/slurm/slurm.conf
JobContainerType=job_container/tmpfs
Prolog=/shared/SlurmScripts/prejob
PrologFlags=contain

/etc/slurm/job_container.conf
#
AutoBasePath=true
BasePath=/shared/BasePath

/shared/SlurmScripts/prejob
#!/usr/bin/env bash
MY_XDG_RUNTIME_DIR=/shared/SlurmXDG
mkdir -p $MY_XDG_RUNTIME_DIR
echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"



On Wed, May 15, 2024 at 2:28 PM Ward Poelmans via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi,
>
> This is systemd, not slurm. We've also seen it being created and removed.
> As far as I understood something about the session that systemd clean up.
> We've worked around by adding this to the prolog:
>
> MY_XDG_RUNTIME_DIR=/dev/shm/${USER}
> mkdir -p $MY_XDG_RUNTIME_DIR
> echo "export XDG_RUNTIME_DIR=$MY_XDG_RUNTIME_DIR"
>
> (in combination with private tmpfs per job).
>
> Ward
>
> On 15/05/2024 10:14, Arnuld via slurm-users wrote:
> > I am using the latest slurm. It  runs fine for scripts. But if I give it
> a container then it kills it as soon as I submit the job. Is slurm cleaning
> up the $XDG_RUNTIME_DIR before it should?  This is the log:
> >
> > [2024-05-15T08:00:35.143] [90.0] debug2: _generate_patterns: StepId=90.0
> TaskId=-1
> > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
> argv[0]=/bin/sh
> > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
> argv[1]=-c
> > [2024-05-15T08:00:35.143] [90.0] debug3: _get_container_state: command
> argv[2]=crun --rootless=true --root=/run/user/1000/ state
> slurm2.acog.90.0.-1
> > [2024-05-15T08:00:35.167] [90.0] debug:  _get_container_state:
> RunTimeQuery rc:256 output:error opening file
> `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory
> >
> > [2024-05-15T08:00:35.167] [90.0] error: _get_container_state:
> RunTimeQuery failed rc:256 output:error opening file
> `/run/user/1000/slurm2.acog.90.0.-1/status`: No such file or directory
> >
> > [2024-05-15T08:00:35.167] [90.0] debug:  container already dead
> > [2024-05-15T08:00:35.167] [90.0] debug3: _generate_spooldir: task:0
> pattern:%m/oci-job%j-%s/task-%t/ path:/var/spool/slurmd/oci-job90-0/task-0/
> > [2024-05-15T08:00:35.167] [90.0] debug2: _generate_patterns: StepId=90.0
> TaskId=0
> > [2024-05-15T08:00:35.168] [90.0] debug3: _generate_spooldir: task:-1
> pattern:%m/oci-job%j-%s/ path:/var/spool/slurmd/oci-job90-0/
> > [2024-05-15T08:00:35.168] [90.0] stepd_cleanup: done with step
> (rc[0x100]:Unknown error 256, cleanup_rc[0x0]:No error)
> > [2024-05-15T08:00:35.275] debug3: in the service_connection
> > [2024-05-15T08:00:35.278] debug2: Start processing RPC:
> REQUEST_TERMINATE_JOB
> > [2024-05-15T08:00:35.278] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> > [2024-05-15T08:00:35.278] debug:  _rpc_terminate_job: uid = 64030
> JobId=90
> > [2024-05-15T08:00:35.278] debug:  credential for job 90 revoked
> >
> >
> >
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Building Slurm debian package vs building from source

2024-05-22 Thread Arnuld via slurm-users
We have several nodes, most of which have different Linux distributions
(distro for short). Controller has a different distro as well. The only
common thing between controller and all the does is that all of them ar
x86_64.

I can install Slurm using package manager on all the machines but this will
not work because controller will have a different version of Slurm compared
to the nodes (21.08 vs 23.11)

If I build from source then I see two solutions:
 - build a deb package
 - build a custom package (./configure, make, make install)

Building a debian package on the controller and then distributing the
binaries on nodes won't work either because that binary will start looking
for the shared libraries that it was built for and those don't exist on the
nodes.

So the only solution I have is to build a static binary using a custom
package. Am I correct or is there another solution here?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Arnuld via slurm-users
> Not that I recommend it much, but you can build them for each
> environment and install the ones needed in each.

Oh cool, I will download the latest version 23.11.7 and build debian
packages on every machine then


> A simple example is when you have nodes with and without GPUs.
> You can build slurmd packages without for those nodes and with for the
> ones that have them.

I do have non-gpu machines.  I guess I need to learn to modify the debian
Control files for this


> Generally, so long as versions are compatible, they can work together.
> You will need to be aware of differences for jobs and configs, but it is
> possible.

you mean the versions of the dependencies are compatible?  It  is true for
most (like munge) but might not be true for others like (yaml or
http-parser). I need to check on that.


On Thu, May 23, 2024 at 1:07 AM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Not that I recommend it much, but you can build them for each
> environment and install the ones needed in each.
>
> A simple example is when you have nodes with and without GPUs.
> You can build slurmd packages without for those nodes and with for the
> ones that have them.
>
> Generally, so long as versions are compatible, they can work together.
> You will need to be aware of differences for jobs and configs, but it is
> possible.
>
> Brian Andrus
>
> On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote:
> > We have several nodes, most of which have different Linux
> > distributions (distro for short). Controller has a different distro as
> > well. The only common thing between controller and all the does is
> > that all of them ar x86_64.
> >
> > I can install Slurm using package manager on all the machines but this
> > will not work because controller will have a different version of
> > Slurm compared to the nodes (21.08 vs 23.11)
> >
> > If I build from source then I see two solutions:
> >  - build a deb package
> >  - build a custom package (./configure, make, make install)
> >
> > Building a debian package on the controller and then distributing the
> > binaries on nodes won't work either because that binary will start
> > looking for the shared libraries that it was built for and those don't
> > exist on the nodes.
> >
> > So the only solution I have is to build a static binary using a custom
> > package. Am I correct or is there another solution here?
> >
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-22 Thread Arnuld via slurm-users
> n fact I am more worried about how the users would benefit
> from such a mixture of execution environments
> ...SNIIP

So what is an ideal setup?   Keep the same .deb distro on all machines and
use apt to install slurm on every machine?

On Thu, May 23, 2024 at 10:20 AM Shunran Zhang <
szh...@ngs.gen-info.osaka-u.ac.jp> wrote:

> Hi Arnuld,
>
> What I would probably do is to build one for each distro and install them
> either directly into /usr/local or using deb package.
>
> The DEBIAN/control is used by apt to manage a couple of things, such as
> indexing so apt search shows what this package is for, which package it
> could replace, which packages are recommended to be installed with it, and
> which packages need to be installed before this can work.
>
> For those machines with a certain brand of GPU, you would need a slurm
> that is configured and compiled with such option ON, and such device driver
> in the DEBIAN/control to allow apt to check the driver on the machine meets
> the requirement of your deb package. You can forget about the second part
> if you are not using deb packages and just compile - run the slurm on the
> client machine.
>
> The last thing he mentioned is about the slurm versions. A slurm client of
> lower version (say 23.02) should be able to talk to a slurmctld of higher
> version (say 23.11) just fine, though the reverse do not apply. For
> dependency management it is of such complexity that maintaining a
> distribution of Linux is quite some work - I knew it as I am a maintainer
> of a Linux distro that uses dpkg packages, but without a debian root and
> uses a different cli tool etc.
>
> In fact I am more worried about how the users would benefit from such a
> mixture of execution environments - a misstep in configuration or a user
> submitting job without specifying enough info on what they asks for would
> probably make the user's job works or does not work purely by chance of
> which node it got executed, and which environment the job's executables are
> built against. It probably need a couple of "similar" nodes to allow users
> benefiting from the job queue to send their job to the place where
> available.
>
> Good luck with your setup
>
> Sincerely,
>
> S. Zhang
> On 2024/05/23 13:04, Arnuld via slurm-users wrote:
>
> > Not that I recommend it much, but you can build them for each
> > environment and install the ones needed in each.
>
> Oh cool, I will download the latest version 23.11.7 and build debian
> packages on every machine then
>
>
> > A simple example is when you have nodes with and without GPUs.
> > You can build slurmd packages without for those nodes and with for the
> > ones that have them.
>
> I do have non-gpu machines.  I guess I need to learn to modify the debian
> Control files for this
>
>
> > Generally, so long as versions are compatible, they can work together.
> > You will need to be aware of differences for jobs and configs, but it is
> > possible.
>
> you mean the versions of the dependencies are compatible?  It  is true for
> most (like munge) but might not be true for others like (yaml or
> http-parser). I need to check on that.
>
>
> On Thu, May 23, 2024 at 1:07 AM Brian Andrus via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> Not that I recommend it much, but you can build them for each
>> environment and install the ones needed in each.
>>
>> A simple example is when you have nodes with and without GPUs.
>> You can build slurmd packages without for those nodes and with for the
>> ones that have them.
>>
>> Generally, so long as versions are compatible, they can work together.
>> You will need to be aware of differences for jobs and configs, but it is
>> possible.
>>
>> Brian Andrus
>>
>> On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote:
>> > We have several nodes, most of which have different Linux
>> > distributions (distro for short). Controller has a different distro as
>> > well. The only common thing between controller and all the does is
>> > that all of them ar x86_64.
>> >
>> > I can install Slurm using package manager on all the machines but this
>> > will not work because controller will have a different version of
>> > Slurm compared to the nodes (21.08 vs 23.11)
>> >
>> > If I build from source then I see two solutions:
>> >  - build a deb package
>> >  - build a custom package (./configure, make, make install)
>> >
>> > Building a debian package on the controller and then distributing the
>> > binaries on nodes wo

[slurm-users] Slurm Build Error

2024-05-23 Thread Arnuld via slurm-users
Getting this error when I run "make install":

echo >>"lib_ref.lo"
/bin/bash ../../libtool  --tag=CC   --mode=link gcc
 -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread
-ggdb3 -Wall -g -O1 -fno-strict-aliasing   -o lib_ref.la   lib_ref.lo
-lpthread -lm -lresolv
libtool: link: ar cr .libs/lib_ref.a
libtool: link: ranlib .libs/lib_ref.a
libtool: link: ( cd ".libs" && rm -f "lib_ref.la" && ln -s "../lib_ref.la" "
lib_ref.la" )
/bin/bash ../../libtool  --tag=CC   --mode=link gcc
 -DNUMA_VERSION1_COMPATIBILITY -g -O2 -fno-omit-frame-pointer -pthread
-ggdb3 -Wall -g -O1 -fno-strict-aliasing -export-dynamic   -o sacctmgr
account_functions.o archive_functions.o association_functions.o
config_functions.o cluster_functions.o common.o event_functions.o
federation_functions.o file_functions.o instance_functions.o
runaway_job_functions.o job_functions.o reservation_functions.o
resource_functions.o sacctmgr.o qos_functions.o txn_functions.o
user_functions.o wckey_functions.o problem_functions.o tres_function.o
-Wl,-rpath=/root/slurm-slurm-23-11-7-1/z/lib/slurm -L../../src/api/.libs
-lslurmfull -export-dynamic -lreadline -lhistory  lib_ref.la -lpthread -lm
-lresolv
libtool: link: gcc -DNUMA_VERSION1_COMPATIBILITY -g -O2
-fno-omit-frame-pointer -pthread -ggdb3 -Wall -g -O1 -fno-strict-aliasing
-o sacctmgr account_functions.o archive_functions.o association_functions.o
config_functions.o cluster_functions.o common.o event_functions.o
federation_functions.o file_functions.o instance_functions.o
runaway_job_functions.o job_functions.o reservation_functions.o
resource_functions.o sacctmgr.o qos_functions.o txn_functions.o
user_functions.o wckey_functions.o problem_functions.o tres_function.o
-Wl,-rpath=/root/slurm-slurm-23-11-7-1/z/lib/slurm -Wl,--export-dynamic
 -L../../src/api/.libs
/root/slurm-slurm-23-11-7-1/src/api/.libs/libslurmfull.a -lreadline
-lhistory ./.libs/lib_ref.a -lpthread -lm -lresolv -pthread
/usr/bin/ld: sacctmgr.o: warning: relocation against
`_binary_usage_txt_end' in read-only section `.text'
/usr/bin/ld: sacctmgr.o: in function `_usage':
/root/slurm-slurm-23-11-7-1/src/sacctmgr/sacctmgr.c:926:(.text+0x1f):
undefined reference to `_binary_usage_txt_start'
/usr/bin/ld:
/root/slurm-slurm-23-11-7-1/src/sacctmgr/sacctmgr.c:926:(.text+0x26):
undefined reference to `_binary_usage_txt_end'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:672: sacctmgr] Error 1
make[3]: Leaving directory '/root/slurm-slurm-23-11-7-1/src/sacctmgr'
make[2]: *** [Makefile:520: all-recursive] Error 1
make[2]: Leaving directory '/root/slurm-slurm-23-11-7-1/src'
make[1]: *** [Makefile:621: all-recursive] Error 1
make[1]: Leaving directory '/root/slurm-slurm-23-11-7-1'
make: *** [Makefile:520: all] Error 2
 
-

I used these config options:

./configure --enable-debug --enable-pam --disable-sview --disable-shared
--with-munge --with-json --with-yaml --with-http-parser --with-pmix
--with-lz4 --with-hwloc --with-jwt --with-libcurl --with-freeipmi
--with-rdkafka --with-bpf --prefix=/root/slurm-slurm-23-11-7-1/z

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] error: unpack_header: protocol_version 9472 not supported

2024-06-05 Thread Arnuld via slurm-users
I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04.
While Slurm runs fine on one machine, on the 2nd machine it does not. First
machine is both a controller and a node while the 2nd machine is just a
node. On both machines, I built the Slurm debian package as per the Slurm
docs instructions. Slurmd logs show this:

 error: unpack_header: protocol_version 9472 not supported
 error: unpacking header
 error: destroy_forward: no init
 error: slurm_receive_msg_and_forward: [[host-4.attlocal.net]:38960]
failed: Message receive failure
 error: service_connection: slurm_receive_msg: Message receive failure
 debug:  _service_connection: incomplete message

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] srun hostname - Socket timed out on send/recv operation

2024-06-10 Thread Arnuld via slurm-users
I have two machines. When I run "srum hostname" on one machine (it's both a
controller and a node) then I get the hostname fine but I get socket timed
out error in these two situations:

1) "srun hostname" on 2nd machine (it's a node)
2) "srun -N 2 hostname" on controller

"scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes
too.  Also the job gets stuck forever in CG state after the error. Here is
the output:

$ srun -N 2 hostname
mach2
srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out on
send/recv operation
srun: error: Task launch for StepId=.0 failed on node hpc4: Socket
timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv
operation
srun: Job step aborted


Output form "squeue" 3 seconds apart

Tue Jun 11 05:09:56 2024
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   poxo hostname   arnuld  R   0:19  2
mach4,mach2

Tue Jun 11 05:09:59 2024
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   poxo hostname   arnuld CG   0:20  1 mach4

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: srun hostname - Socket timed out on send/recv operation

2024-06-11 Thread Arnuld via slurm-users
I enabled "debug3" logging and saw this in the node log:

error: mpi_conf_send_stepd: unable to resolve MPI plugin offset from
plugin_id=106. This error usually results from a job being submitted
against an MPI plugin which was not compiled into slurmd but was for job
submission command.
error: _send_slurmstepd_init: mpi_conf_send_stepd(9, 106) failed: No error

I removed "MpiDefault" option from slurm.conf and now "srun -N2 -l
hostname" returns hostnames of all machines



On Tue, Jun 11, 2024 at 11:05 AM Arnuld  wrote:

> I have two machines. When I run "srum hostname" on one machine (it's both
> a controller and a node) then I get the hostname fine but I get socket
> timed out error in these two situations:
>
> 1) "srun hostname" on 2nd machine (it's a node)
> 2) "srun -N 2 hostname" on controller
>
> "scontrol show node" shows both mach2 and mach4. "sinfo" shows both nodes
> too.  Also the job gets stuck forever in CG state after the error. Here is
> the output:
>
> $ srun -N 2 hostname
> mach2
> srun: error: slurm_receive_msgs: [[mach4]:6818] failed: Socket timed out
> on send/recv operation
> srun: error: Task launch for StepId=.0 failed on node hpc4: Socket
> timed out on send/recv operation
> srun: error: Application launch failed: Socket timed out on send/recv
> operation
> srun: Job step aborted
>
>
> Output form "squeue" 3 seconds apart
>
> Tue Jun 11 05:09:56 2024
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>    poxo hostname   arnuld  R   0:19  2
> mach4,mach2
>
> Tue Jun 11 05:09:59 2024
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>    poxo hostname   arnuld CG   0:20  1 mach4
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Debian RPM build for arm64?

2024-06-13 Thread Arnuld via slurm-users
I dont' know much about Slurm but if you want to start troubleshooting then
you need to isolate the step where error appears. From the output you have
posted , it looks like you are using some automated script to download,
extract and build Slurm. Look here:

 "/bin/sh -c cd /tmp  && wget
https://download.schedmd.com/slurm/slurm-23.11.7.tar.bz2  && tar -xaf
slurm-23.11.7.tar.bz2  && cd slurm-23.11.7  && mk-build-deps -t \"apt-get
-o Debug::pkgProblemResolver=yes -y\" -i debian/control  && debuild -b -uc
-us  && .

Here 6 steps have combined together with &&. I would do these steps by
hand, manually, one by one and see where error occurs. You might get some
extra information about the error this way



On Fri, Jun 14, 2024 at 4:11 AM Christopher Harrop - NOAA Affiliate via
slurm-users  wrote:

> Hello,
>
> Are the instructions for building Debian RPMs found at
> https://slurm.schedmd.com/quickstart_admin.html#debuild expected to work
> on ARM machines?
>
> I am having trouble with the "debuild -b -uc -us” step.
>
> #10 29.01 configure: exit 1
> #10 29.01 dh_auto_configure: error: cd obj-aarch64-linux-gnu &&
> ../configure --build=aarch64-linux-gnu --prefix=/usr
> --includedir=\${prefix}/include --mandir=\${prefix}/share/man
> --infodir=\${prefix}/share/info --sysconfdir=/etc --localstatedir=/var
> --disable-silent-rules --libdir=\${prefix}/lib/aarch64-linux-gnu
> --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking
> --sysconfdir=/etc/slurm --disable-debug --with-slurmrestd --with-pmix
> --enable-pam --with-systemdsystemunitdir=/lib/systemd/system/ SUCMD=/bin/su
> SLEEP_CMD=/bin/sleep returned exit code 1
> #10 29.01 make[1]: *** [debian/rules:21: override_dh_auto_configure] Error
> 25
> #10 29.01 make[1]: Leaving directory '/tmp/slurm-23.11.7'
> #10 29.02 make: *** [debian/rules:6: build] Error 2
> #10 29.02 dpkg-buildpackage: error: debian/rules build subprocess returned
> exit status 2
> #10 29.02 debuild: fatal error at line 1182:
> #10 29.02 dpkg-buildpackage -us -uc -ui -b failed
> #10 ERROR: process "/bin/sh -c cd /tmp  && wget
> https://download.schedmd.com/slurm/slurm-23.11.7.tar.bz2  && tar -xaf
> slurm-23.11.7.tar.bz2  && cd slurm-23.11.7  && mk-build-deps -t \"apt-get
> -o Debug::pkgProblemResolver=yes -y\" -i debian/control  && debuild -b -uc
> -us  && cd ..  && ARCH=$(dpkg --print-architecture)  && dpkg --install
> slurm-smd_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-client_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-dev_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-doc_23.11.7-1_all.deb  && dpkg --install
> slurm-smd-libnss-slurm_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-libpam-slurm-adopt_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-libpmi0_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-libpmi2-0_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-libslurm-perl_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-sackd_23.11.7-1_${ARCH}.deb  && dpkg --install
> slurm-smd-sview_23.11.7-1_${ARCH}.deb" did not complete successfully: exit
> code: 29
>
> Chris
>
> ---
> Christopher W. Harrop
>   voice: (720) 649-0316
> NOAA Global Systems Laboratory, R/GSL6  fax: (303)
> 497-7259
> 325 Broadway
> Boulder, CO 80303
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Can Not Use A Single GPU for Multiple Jobs

2024-06-20 Thread Arnuld via slurm-users
I have a machine with a quad-core CPU and an Nvidia GPU with 3500+ cores.
I want to run around 10 jobs in parallel on the GPU (mostly are CUDA based
jobs).

PROBLEM: Each job asks for only 100 shards (runs usually for a minute or
so), then I should be able to run 3500/100 = 35 jobs in parallel but slurm
runs only 4 jobs in parallel keeping the rest in the queue.

I have this in slurm.conf and gres.conf:

# GPU
GresTypes=gpu,shard
# COMPUTE NODES
PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP`
PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP
NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500 CPUs=4
Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1
RealMemory=64255 State=UNKNOWN
--
Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1
Name=shard Count=3500  File=/dev/nvidia0

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Arnuld via slurm-users
> Every job will need at least 1 core just to run
> and if there are only 4 cores on the machine,
> one would expect a max of 4 jobs to run.

I have 3500+ GPU cores available. You mean each GPU job requires at least
one CPU? Can't we run a job with just GPU without any CPUs? This sbatch
script requires 100 GPU cores, can;t we run 35 in parallel?

#! /usr/bin/env bash

#SBATCH --output="%j.out"
#SBATCH --error="%j.error"
#SBATCH --partition=pgpu
#SBATCH --gres=shard:100

sleep 10
echo "Current date and time: $(date +"%Y-%m-%d %H:%M:%S")"
echo "Running..."
sleep 10






On Thu, Jun 20, 2024 at 11:23 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Well, if I am reading this right, it makes sense.
>
> Every job will need at least 1 core just to run and if there are only 4
> cores on the machine, one would expect a max of 4 jobs to run.
>
> Brian Andrus
>
> On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
> > I have a machine with a quad-core CPU and an Nvidia GPU with 3500+
> > cores.  I want to run around 10 jobs in parallel on the GPU (mostly
> > are CUDA based jobs).
> >
> > PROBLEM: Each job asks for only 100 shards (runs usually for a minute
> > or so), then I should be able to run 3500/100 = 35 jobs in
> > parallel but slurm runs only 4 jobs in parallel keeping the rest in
> > the queue.
> >
> > I have this in slurm.conf and gres.conf:
> >
> > # GPU
> > GresTypes=gpu,shard
> > # COMPUTE NODES
> > PartitionName=pzero Nodes=ALL Default=YES MaxTime=INFINITE State=UP`
> > PartitionName=pgpu Nodes=hostgpu MaxTime=INFINITE State=UP
> > NodeName=hostgpu NodeAddr=x.x.x.x Gres=gpu:gtx_1080_ti:1,shard:3500
> > CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1
> > RealMemory=64255 State=UNKNOWN
> > --
> > Name=gpu Type=gtx_1080_ti File=/dev/nvidia0 Count=1
> > Name=shard Count=3500  File=/dev/nvidia0
> >
> >
> >
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-23 Thread Arnuld via slurm-users
> No, Slurm has to launch the batch script on compute node cores
> ... SNIP...
> Even with srun directly from a login node there's still processes that
> have to run on the compute node and those need at least a core
>  (and some may need more, depending on the application).

Alright, understood.


On Sat, Jun 22, 2024 at 12:47 AM Christopher Samuel via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> On 6/21/24 3:50 am, Arnuld via slurm-users wrote:
>
> > I have 3500+ GPU cores available. You mean each GPU job requires at
> > least one CPU? Can't we run a job with just GPU without any CPUs?
>
> No, Slurm has to launch the batch script on compute node cores and it
> then has the job of launching the users application that will run
> something on the node that will access the GPU(s).
>
> Even with srun directly from a login node there's still processes that
> have to run on the compute node and those need at least a core (and some
> may need more, depending on the application).
>
> --
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: error: unpack_header: protocol_version 9472 not supported

2024-06-23 Thread Arnuld via slurm-users
I found the problem. It was not that this node was trying to reach some
machine. It was the other way around, some other machine (running
controller) had this node in the config there, and hence that controller
was trying to reach to this. It was a different slurm cluster. I removed
the config from there and all is fine now.

On Wed, Jun 5, 2024 at 1:12 PM Arnuld  wrote:

> I have built Slurm 23.11.7 on two machines. Both are running Ubuntu 22.04.
> While Slurm runs fine on one machine, on the 2nd machine it does not. First
> machine is both a controller and a node while the 2nd machine is just a
> node. On both machines, I built the Slurm debian package as per the Slurm
> docs instructions. Slurmd logs show this:
>
>  error: unpack_header: protocol_version 9472 not supported
>  error: unpacking header
>  error: destroy_forward: no init
>  error: slurm_receive_msg_and_forward: [[host-4.attlocal.net]:38960]
> failed: Message receive failure
>  error: service_connection: slurm_receive_msg: Message receive failure
>  debug:  _service_connection: incomplete message
>
>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Using sharding

2024-07-05 Thread Arnuld via slurm-users
> On Fri, Jul 5, 2024 at 12:19 PM Ward Poelmans
>  via slurm-users  wrote:

> Hi Ricardo,
>
> It should show up like this:
>
> Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
>

What's the meaning of (S:0-1) here?

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com