[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-23 Thread Christopher Samuel via slurm-users

Hi Robert,

On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:

We switched over from using systemctl for tmp.mount and change to zram, 
e.g.,

modprobe zram
echo 20GB > /sys/block/zram0/disksize
mkfs.xfs /dev/zram0
mount -o discard /dev/zram0 /tmp

[...]
> [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: 
failed to create temporary XAUTHORITY file: Permission denied


Where do you set the permissions on /tmp ?  What do you set them to?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Christopher Samuel via slurm-users

On 3/3/24 23:04, John Joseph via slurm-users wrote:


Is SWAP a mandatory requirement


All our compute nodes are diskless, so no swap on them.

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Christopher Samuel via slurm-users

On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote:


In our case, that node has been removed from the cluster and cannot be
added back right now ( is being used for some other work ). What can we
do in such a case?


Mark the node as "DOWN" in Slurm, this is what we do when we get jobs 
caught in this state (and there's nothing else on the node for our 
shared nodes).


Best of luck!
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-04 Thread Christopher Samuel via slurm-users

On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote:


Any clues?

 > ld: error: unknown emulation: elf_aarch64


All I can think is that your ld doesn't like elf_aarch64, from the log 
your posting it looks that's being injected from the FreeBSD ports 
system. Looking at the man page for ld on Linux it says:


  -m emulation
   Emulate the emulation linker.  You can list the available 
emulations with the --verbose or -V options.


So I'd guess you'd need to look at what that version of ld supports and 
then update the ports system to match.


Good luck!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users

On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote:


Any clues about "elf_aarch64" and "aarch64elf" mismatch?


As I mentioned I think this is coming from the FreeBSD patching that's 
being done to the upstream Slurm sources, specifically it looks like 
elf_aarch64 is being injected here:


/usr/bin/sed -i.bak -e 's|"/proc|"/compat/linux/proc|g'  -e 
's|(/proc)|(/compat/linux/proc)|g' 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmd/slurmstepd/req.c
/usr/bin/find 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/api 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/plugins/openapi 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/sacctmgr 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/sackd 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scontrol 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scrontab 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scrun 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmctld 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmd/slurmd 
/wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/squeue 
-name Makefile.in | /usr/bin/xargs	 /usr/bin/sed -i.bak -e 's|-r -o|-r 
-m elf_aarch64 -o|'


So I guess that will need to be fixed to match what FreeBSD supports.

I don't think this is a Slurm issue from what I see there.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users

On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote:


Fixed with:


[...]


Thanks and sorry for the noise as I really missed this detail :)


So glad it helped! Best of luck with this work.

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Christopher Samuel via slurm-users

Hi Jeff!

On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote:

I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu 
packages. I now want to install pyxis but it says I need the Slurm 
sources. In Ubuntu 22.04, is there a package that has the source code? 
How to download the sources I need from github?


You shouldn't need Github, this should give you what you are after 
(especially the "Download slurm-wlm" section at the end):


https://packages.ubuntu.com/source/jammy/slurm-wlm

Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Christopher Samuel via slurm-users

On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote:


A simple example is when you have nodes with and without GPUs.
You can build slurmd packages without for those nodes and with for the 
ones that have them.


FWIW we have both GPU and non-GPU nodes but we use the same RPMs we 
build on both (they all boot the same SLES15 OS image though).


--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Christopher Samuel via slurm-users

On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote:


Also, server must be newer than client.


This is the major issue for the OP - the version rule is:

slurmdbd >= slurmctld >= slurmd and clients

and no more than the permitted skew in versions.

Plus, of course, you have to deal with config file compatibility issues 
between versions.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Christopher Samuel via slurm-users

On 6/21/24 3:50 am, Arnuld via slurm-users wrote:

I have 3500+ GPU cores available. You mean each GPU job requires at 
least one CPU? Can't we run a job with just GPU without any CPUs?


No, Slurm has to launch the batch script on compute node cores and it 
then has the job of launching the users application that will run 
something on the node that will access the GPU(s).


Even with srun directly from a login node there's still processes that 
have to run on the compute node and those need at least a core (and some 
may need more, depending on the application).


--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Upgrade node while jobs running

2024-08-02 Thread Christopher Samuel via slurm-users

G'day Sid,

On 7/31/24 5:02 pm, Sid Young via slurm-users wrote:

I've been waiting for node to become idle before upgrading them however 
some jobs take a long time. If I try to remove all the packages I assume 
that kills the slurmstep program and with it the job.


Are you looking to do a Slurm upgrade, an OS upgrade, or both?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: REST API - get_user_environment

2024-08-15 Thread Christopher Samuel via slurm-users

On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote:


I am referring to the REST API. We have had it installed for a few years and have 
recently upgraded it so that we can use v0.0.40. But this most recent version is missing 
the "get_user_environment" field which existed in previous versions.


I had a look at the code in Slurm 23.11 and it looks like it is in the 
v0.0.38 but not in the v0.0.39 version there. It looks like the code was 
restructured significantly around that time, so I'm not competent to say 
if this is because it moved elsewhere and I'm not seeing it, or if it 
got dropped then.


--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread Christopher Samuel via slurm-users

On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:


It seems like there's an issue with the termination process on these nodes. Any 
thoughts on what could be causing this?


That usually means processes wedged in the kernel for some reason, in an 
uninterruptible sleep state. You can define an "UnkillableStepProgram" 
to be run on the node when that happens to capture useful state info. 
You can do that by doing things like iterating through processes in the 
jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting 
the `ps` info for all those same processes, and/or doing an `echo w > 
/proc/sysrq-trigger` to make the kernel dump all blocked tasks.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Christopher Samuel via slurm-users

On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote:

Is there an option in slurm to launch a custom script at the time of job 
submission through sbatch

or salloc? The script should run with submit user permission in submit area.


I think you are after the cli_filter functionality which can run plugins 
in that environment. There is a Lua plugin for that which will allow you 
to write your code in something a little less fraught than C.


https://slurm.schedmd.com/cli_filter_plugins.html

There is example Lua code for this here:

https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Randomly draining nodes

2024-10-24 Thread Christopher Samuel via slurm-users

Hi Ole,

On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote:

Some time ago it was recommended that UnkillableStepTimeout values above 
127 (or 256?) should not be used, see https://support.schedmd.com/ 
show_bug.cgi?id=11103.  I don't know if this restriction is still valid 
with recent versions of Slurm?


As I read it that last comment includes a commit message for the fix to 
that problem, and we happily use a much longer timeout than that without 
apparent issue.


https://support.schedmd.com/show_bug.cgi?id=11103#c30

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: sinfo not listing any partitions

2024-11-27 Thread Christopher Samuel via slurm-users

On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote:

I have restarted the slurmctld and slurmd services several times. I 
hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root 
with the same result.


Are your nodes in the `FUTURE` state perhaps? What does this show?

sinfo -aFho "%N %T"

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Christopher Samuel via slurm-users

On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:

Just built 4 x rocky9 nodes and I do not get that error (but I get 
another I know how to fix, I think) so holistically  I am thinking the 
version difference is too large.


Oh I think I missed this - when you say version difference do you mean 
the Slurm version or the distro version?


I was assuming you were building your Slurm versions yourselves for 
both, but that may be way off the mark, sorry!


What are the Slurm versions everywhere?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Christopher Samuel via slurm-users

On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:

I observed similar symptoms when we had issues with the shared Lustre 
file system. When the file system couldn't complete an I/O operation, 
the process in Slurm remained in the CG state until the file system 
became responsive again. An additional symptom was that the blocking 
process was stuck in the D state.


We've seen the same behaviour, though for us we use an 
"UnkillableStepProgram" to deal with compute nodes where user processes 
(as opposed to Slurm daemons, which sounds like the issue for the 
original poster here) get stuck and are unkillable.


Our script does things like "echo w > /proc/sysrq-trigger" to get the 
kernel to dump its view of all stuck processes and then it goes through 
the stuck jobs cgroup to find all the processes and dump 
/proc/$PID/stack for each process and then thread it finds there.


In the end it either marks the node down (if it's the only job on the 
node which will mark the job as complete in Slurm, though will not free 
up those stuck processes) or drains the node if it's running multiple 
jobs. In both cases we'll come back and check the issue out (and our 
SREs will wake us up if they think there's an unusual number of these).


That final step is important because a node stuck completing can really 
confuse backfill scheduling for us as slurmctld assumes it will become 
free any second now and try and use the node for planning jobs, despite 
it being stuck. So marking it down/drain gets it out of slurmctld's view 
as a potential future node.


For nodes where a Slurm daemon on the node is stuck that script will not 
fire and so our SRE's have alarms that trip after a node has been 
completing for longer than a certain amount of time. They go and look at 
what's going on and get the node out of the system before utilisation 
collapses (and wake us up if that number seems to be increasing).


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: mariadb refusing access

2025-03-04 Thread Christopher Samuel via slurm-users

On 3/4/25 5:23 pm, Steven Jones via slurm-users wrote:

However   mysql -u slurm -p   works just fine so it seems to be a config 
error for slurmdbd


Try:

mysql -h 127.0.0.1 -u slurm -p

IIRC without that it'll try a UNIX domain socket and not try and connect 
via TCP/IP.


--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: errors while trying to setup slurmdbd.

2025-04-09 Thread Christopher Samuel via slurm-users

Hi Steven,

On 4/9/25 5:00 pm, Steven Jones via slurm-users wrote:

Apr 10 10:28:52 vuwunicohpcdbp1.ods.vuw.ac.nz slurmdbd[2413]: slurmdbd: 
fatal: This host not configured to run SlurmDBD ((vuwunicohpcdbp1 or 
vuwunicohp>


^^^ that's the critical error message, and it's reporting that because 
slurmdbd.conf has:



DbdHost=vuwunicoslurmrp1.ods.vuw.ac.nz


That needs to match the hostname where you want to run slurmdbd.

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users

Hiya,

On 4/15/25 7:03 pm, lyz--- via slurm-users wrote:


Hi, Christ. Thank you for continuing paying attention to this issue.
I followed your instuction. And This is the output:

[root@head1 ~]# systemctl cat slurmd | fgrep Delegate
Delegate=yes


That looks good to me, thanks for sharing that!

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users

On 4/15/25 6:57 pm, lyz--- via slurm-users wrote:


Hi, Sean. It's the latest slurm version.
[root@head1 ~]# sinfo --version
slurm 22.05.3


That's quite old (and no longer supported), the oldest still supported 
version is 23.11.10 and 24.11.4 came out recently.


What does the cgroup.conf file on one of your compute nodes look like?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users

On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote:

What version of Slurm are you running and what's the contents of your 
gres.conf file?


Also what does this say?

systemctl cat slurmd | fgrep Delegate

--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-14 Thread Christopher Samuel via slurm-users

On 4/14/25 6:27 am, lyz--- via slurm-users wrote:


This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user 
submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or 
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 
GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected.​ 
How can I resolve this situation.


You need to make sure you're using cgroups to control access to devices 
for tasks, a starting point for reading up on this is here:


https://slurm.schedmd.com/cgroups.html

Good luck!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com