from:"Christopher Samuel via slurm\-users"

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

2024-02-23 Thread Christopher Samuel via slurm-users

Hi Robert, On 2/23/24 17:38, Robert Kudyba via slurm-users wrote: We switched over from using systemctl for tmp.mount and change to zram, e.g., modprobe zram echo 20GB > /sys/block/zram0/disksize mkfs.xfs /dev/zram0 mount -o discard /dev/zram0 /tmp [...] > [2024-02-23T20:26:15.881] [530.exter

[slurm-users] Re: Is SWAP memory mandatory for SLURM

2024-03-04 Thread Christopher Samuel via slurm-users

On 3/3/24 23:04, John Joseph via slurm-users wrote: Is SWAP a mandatory requirement All our compute nodes are diskless, so no swap on them. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an e

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Christopher Samuel via slurm-users

On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get job

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-04 Thread Christopher Samuel via slurm-users

On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote: Any clues? > ld: error: unknown emulation: elf_aarch64 All I can think is that your ld doesn't like elf_aarch64, from the log your posting it looks that's being injected from the FreeBSD ports system. Looking at the man page for ld on

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users

On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote: Any clues about "elf_aarch64" and "aarch64elf" mismatch? As I mentioned I think this is coming from the FreeBSD patching that's being done to the upstream Slurm sources, specifically it looks like elf_aarch64 is being injected here: /

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Christopher Samuel via slurm-users

On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote: Fixed with: [...] Thanks and sorry for the noise as I really missed this detail :) So glad it helped! Best of luck with this work. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slu

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Christopher Samuel via slurm-users

Hi Jeff! On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote: I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu packages. I now want to install pyxis but it says I need the Slurm sources. In Ubuntu 22.04, is there a package that has the source code? How to download t

[slurm-users] Re: Building Slurm debian package vs building from source

2024-05-23 Thread Christopher Samuel via slurm-users

On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote: A simple example is when you have nodes with and without GPUs. You can build slurmd packages without for those nodes and with for the ones that have them. FWIW we have both GPU and non-GPU nodes but we use the same RPMs we build on both

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Christopher Samuel via slurm-users

On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote: Also, server must be newer than client. This is the major issue for the OP - the version rule is: slurmdbd >= slurmctld >= slurmd and clients and no more than the permitted skew in versions. Plus, of course, you have to deal with

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

2024-06-21 Thread Christopher Samuel via slurm-users

On 6/21/24 3:50 am, Arnuld via slurm-users wrote: I have 3500+ GPU cores available. You mean each GPU job requires at least one CPU? Can't we run a job with just GPU without any CPUs? No, Slurm has to launch the batch script on compute node cores and it then has the job of launching the users

[slurm-users] Re: Upgrade node while jobs running

2024-08-02 Thread Christopher Samuel via slurm-users

G'day Sid, On 7/31/24 5:02 pm, Sid Young via slurm-users wrote: I've been waiting for node to become idle before upgrading them however some jobs take a long time. If I try to remove all the packages I assume that kills the slurmstep program and with it the job. Are you looking to do a Slurm

[slurm-users] Re: REST API - get_user_environment

2024-08-15 Thread Christopher Samuel via slurm-users

On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote: I am referring to the REST API. We have had it installed for a few years and have recently upgraded it so that we can use v0.0.40. But this most recent version is missing the "get_user_environment" field which existed in previous versions.

[slurm-users] Re: Randomly draining nodes

2024-10-21 Thread Christopher Samuel via slurm-users

On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote: It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this? That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define

[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Christopher Samuel via slurm-users

On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote: Is there an option in slurm to launch a custom script at the time of job submission through sbatch or salloc? The script should run with submit user permission in submit area. I think you are after the cli_filter functionality w

[slurm-users] Re: Randomly draining nodes

2024-10-24 Thread Christopher Samuel via slurm-users

Hi Ole, On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote: Some time ago it was recommended that UnkillableStepTimeout values above 127 (or 256?) should not be used, see https://support.schedmd.com/ show_bug.cgi?id=11103. I don't know if this restriction is still valid with recent

[slurm-users] Re: sinfo not listing any partitions

2024-11-27 Thread Christopher Samuel via slurm-users

On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote: I have restarted the slurmctld and slurmd services several times. I hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root with the same result. Are your nodes in the `FUTURE` state perhaps? What does this show? si

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Christopher Samuel via slurm-users

On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote: Just built 4 x rocky9 nodes and I do not get that error (but I get another I know how to fix, I think) so holistically I am thinking the version difference is too large. Oh I think I missed this - when you say version difference do you m

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Christopher Samuel via slurm-users

On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote: I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An a

[slurm-users] Re: mariadb refusing access

2025-03-04 Thread Christopher Samuel via slurm-users

On 3/4/25 5:23 pm, Steven Jones via slurm-users wrote: However mysql -u slurm -p works just fine so it seems to be a config error for slurmdbd Try: mysql -h 127.0.0.1 -u slurm -p IIRC without that it'll try a UNIX domain socket and not try and connect via TCP/IP. -- Chris Samuel : h

[slurm-users] Re: errors while trying to setup slurmdbd.

2025-04-09 Thread Christopher Samuel via slurm-users

Hi Steven, On 4/9/25 5:00 pm, Steven Jones via slurm-users wrote: Apr 10 10:28:52 vuwunicohpcdbp1.ods.vuw.ac.nz slurmdbd[2413]: slurmdbd: fatal: This host not configured to run SlurmDBD ((vuwunicohpcdbp1 or vuwunicohp> ^^^ that's the critical error message, and it's reporting that because s

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users

Hiya, On 4/15/25 7:03 pm, lyz--- via slurm-users wrote: Hi, Christ. Thank you for continuing paying attention to this issue. I followed your instuction. And This is the output: [root@head1 ~]# systemctl cat slurmd | fgrep Delegate Delegate=yes That looks good to me, thanks for sharing that!

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users

On 4/15/25 6:57 pm, lyz--- via slurm-users wrote: Hi, Sean. It's the latest slurm version. [root@head1 ~]# sinfo --version slurm 22.05.3 That's quite old (and no longer supported), the oldest still supported version is 23.11.10 and 24.11.4 came out recently. What does the cgroup.conf file o

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-15 Thread Christopher Samuel via slurm-users

On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote: What version of Slurm are you running and what's the contents of your gres.conf file? Also what does this say? systemctl cat slurmd | fgrep Delegate -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users maili

[slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm

2025-04-14 Thread Christopher Samuel via slurm-users

On 4/14/25 6:27 am, lyz--- via slurm-users wrote: This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utili

[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo

[slurm-users] Re: Is SWAP memory mandatory for SLURM

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

[slurm-users] Re: Location of Slurm source packages?

[slurm-users] Re: Building Slurm debian package vs building from source

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

[slurm-users] Re: Can Not Use A Single GPU for Multiple Jobs

[slurm-users] Re: Upgrade node while jobs running

[slurm-users] Re: REST API - get_user_environment

[slurm-users] Re: Randomly draining nodes

[slurm-users] Re: Job pre / post submit scripts

[slurm-users] Re: Randomly draining nodes

[slurm-users] Re: sinfo not listing any partitions

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

[slurm-users] Re: jobs getting stuck in CG

[slurm-users] Re: mariadb refusing access

[slurm-users] Re: errors while trying to setup slurmdbd.

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

[slurm-users] Re: [EXT] Re: Issue with Enforcing GPU Usage Limits in Slurm

[slurm-users] Re: Issue with Enforcing GPU Usage Limits in Slurm

24 matches

Site Navigation

Mail list logo

Footer information