[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo
Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] > [2024-02-23T20:26:15.881] [530.extern] error: setup_x11_forward: failed to create temporary XAUTHORITY file: Permission denied Where do you set the permissions on /tmp ? What do you set them to? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Is SWAP memory mandatory for SLURM
On 3/3/24 23:04, John Joseph via slurm-users wrote: Is SWAP a mandatory requirement All our compute nodes are diskless, so no swap on them. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get jobs caught in this state (and there's nothing else on the node for our shared nodes). Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote: Any clues? > ld: error: unknown emulation: elf_aarch64 All I can think is that your ld doesn't like elf_aarch64, from the log your posting it looks that's being injected from the FreeBSD ports system. Looking at the man page for ld on Linux it says: -m emulation Emulate the emulation linker. You can list the available emulations with the --verbose or -V options. So I'd guess you'd need to look at what that version of ld supports and then update the ports system to match. Good luck! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote: Any clues about "elf_aarch64" and "aarch64elf" mismatch? As I mentioned I think this is coming from the FreeBSD patching that's being done to the upstream Slurm sources, specifically it looks like elf_aarch64 is being injected here: /usr/bin/sed -i.bak -e 's|"/proc|"/compat/linux/proc|g' -e 's|(/proc)|(/compat/linux/proc)|g' /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmd/slurmstepd/req.c /usr/bin/find /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/api /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/plugins/openapi /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/sacctmgr /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/sackd /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scontrol /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scrontab /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/scrun /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmctld /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/slurmd/slurmd /wrkdirs/usr/ports/sysutils/slurm-wlm/work/slurm-23.11.6/src/squeue -name Makefile.in | /usr/bin/xargs /usr/bin/sed -i.bak -e 's|-r -o|-r -m elf_aarch64 -o|' So I guess that will need to be fixed to match what FreeBSD supports. I don't think this is a Slurm issue from what I see there. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote: Fixed with: [...] Thanks and sorry for the noise as I really missed this detail :) So glad it helped! Best of luck with this work. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Location of Slurm source packages?
Hi Jeff! On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote: I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 22.04, is there a package that has the source code? How to download the sources I need from github? You shouldn't need Github, this should give you what you are after (especially the "Download slurm-wlm" section at the end): https://packages.ubuntu.com/source/jammy/slurm-wlm Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Building Slurm debian package vs building from source
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have both GPU and non-GPU nodes but we use the same RPMs we build on both (they all boot the same SLES15 OS image though). -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11
On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote: Also, server must be newer than client. This is the major issue for the OP - the version rule is: slurmdbd >= slurmctld >= slurmd and clients and no more than the permitted skew in versions. Plus, of course, you have to deal with config file compatibility issues between versions. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs
On 6/21/24 3:50 am, Arnuld via slurm-users wrote: I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users application that will run something on the node that will access the GPU(s). Even with srun directly from a login node there's still processes that have to run on the compute node and those need at least a core (and some may need more, depending on the application). -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Upgrade node while jobs running
G'day Sid, On 7/31/24 5:02 pm, Sid Young via slurm-users wrote: I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job. Are you looking to do a Slurm upgrade, an OS upgrade, or both? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote: I am referring to the REST API. We have had it installed for a few years and have recently upgraded it so that we can use v0.0.40. But this most recent version is missing the "get_user_environment" field which existed in previous versions. I had a look at the code in Slurm 23.11 and it looks like it is in the v0.0.38 but not in the v0.0.39 version there. It looks like the code was restructured significantly around that time, so I'm not competent to say if this is because it moved elsewhere and I'm not seeing it, or if it got dropped then. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Randomly draining nodes
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote: It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this? That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define an "UnkillableStepProgram" to be run on the node when that happens to capture useful state info. You can do that by doing things like iterating through processes in the jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to make the kernel dump all blocked tasks. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Job pre / post submit scripts
On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote: Is there an option in slurm to launch a custom script at the time of job submission through sbatch or salloc? The script should run with submit user permission in submit area. I think you are after the cli_filter functionality which can run plugins in that environment. There is a Lua plugin for that which will allow you to write your code in something a little less fraught than C. https://slurm.schedmd.com/cli_filter_plugins.html There is example Lua code for this here: https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Randomly draining nodes
Hi Ole, On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote: Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent versions of Slurm? As I read it that last comment includes a commit message for the fix to that problem, and we happily use a much longer timeout than that without apparent issue. https://support.schedmd.com/show_bug.cgi?id=11103#c30 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: sinfo not listing any partitions
On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote: I have restarted the slurmctld and slurmd services several times. I hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root with the same result. Are your nodes in the `FUTURE` state perhaps? What does this show? sinfo -aFho "%N %T" -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld
On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote: Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large. Oh I think I missed this - when you say version difference do you mean the Slurm version or the distro version? I was assuming you were building your Slurm versions yourselves for both, but that may be way off the mark, sorry! What are the Slurm versions everywhere? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: jobs getting stuck in CG
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote: I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state. We've seen the same behaviour, though for us we use an "UnkillableStepProgram" to deal with compute nodes where user processes (as opposed to Slurm daemons, which sounds like the issue for the original poster here) get stuck and are unkillable. Our script does things like "echo w > /proc/sysrq-trigger" to get the kernel to dump its view of all stuck processes and then it goes through the stuck jobs cgroup to find all the processes and dump /proc/$PID/stack for each process and then thread it finds there. In the end it either marks the node down (if it's the only job on the node which will mark the job as complete in Slurm, though will not free up those stuck processes) or drains the node if it's running multiple jobs. In both cases we'll come back and check the issue out (and our SREs will wake us up if they think there's an unusual number of these). That final step is important because a node stuck completing can really confuse backfill scheduling for us as slurmctld assumes it will become free any second now and try and use the node for planning jobs, despite it being stuck. So marking it down/drain gets it out of slurmctld's view as a potential future node. For nodes where a Slurm daemon on the node is stuck that script will not fire and so our SRE's have alarms that trip after a node has been completing for longer than a certain amount of time. They go and look at what's going on and get the node out of the system before utilisation collapses (and wake us up if that number seems to be increasing). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: mariadb refusing access
On 3/4/25 5:23 pm, Steven Jones via slurm-users wrote: However mysql -u slurm -p works just fine so it seems to be a config error for slurmdbd Try: mysql -h 127.0.0.1 -u slurm -p IIRC without that it'll try a UNIX domain socket and not try and connect via TCP/IP. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: errors while trying to setup slurmdbd.
Hi Steven, On 4/9/25 5:00 pm, Steven Jones via slurm-users wrote: Apr 10 10:28:52 vuwunicohpcdbp1.ods.vuw.ac.nz slurmdbd[2413]: slurmdbd: fatal: This host not configured to run SlurmDBD ((vuwunicohpcdbp1 or vuwunicohp> ^^^ that's the critical error message, and it's reporting that because slurmdbd.conf has: DbdHost=vuwunicoslurmrp1.ods.vuw.ac.nz That needs to match the hostname where you want to run slurmdbd. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm
Hiya, On 4/15/25 7:03 pm, lyz--- via slurm-users wrote: Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output: [root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes That looks good to me, thanks for sharing that! -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm
On 4/15/25 6:57 pm, lyz--- via slurm-users wrote: Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3 That's quite old (and no longer supported), the oldest still supported version is 23.11.10 and 24.11.4 came out recently. What does the cgroup.conf file on one of your compute nodes look like? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm
On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote: What version of Slurm are you running and what's the contents of your gres.conf file? Also what does this say? systemctl cat slurmd | fgrep Delegate -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm
On 4/14/25 6:27 am, lyz--- via slurm-users wrote: This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utilizes all 4 GPUs during execution. This indicates that the GPU usage limit is not being enforced as expected. How can I resolve this situation. You need to make sure you're using cgroups to control access to devices for tasks, a starting point for reading up on this is here: https://slurm.schedmd.com/cgroups.html Good luck! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com