[slurm-users] Re: Restricting local disk storage of jobs

2024-02-07 Thread Tim Schneider via slurm-users
xecution times.  The main question is "where does the tmpfs plugin find the quota limit for the job?" On Feb 6, 2024, at 08:39, Tim Schneider via slurm-users wrote: Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tm

[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users
+0100, Tim Schneider wrote: Hi Magnus, thanks for your reply! If you can, would you mind sharing the InitScript of your attempt at getting it to work? Best, Tim On 06.02.24 15:19, Hagdorn, Magnus Karl Moritz wrote: Hi Tim, we are using the container/tmpfs plugin to map /tmp to a local NVMe

[slurm-users] Re: [ext] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users
ot of local scratch space. I don't think this happens very often if at all. Regards magnus [1] https://slurm.schedmd.com/job_container.conf.html#OPT_InitScript On Tue, 2024-02-06 at 14:39 +0100, Tim Schneider via slurm-users wrote: Hi, In our SLURM cluster, we are using the job_conta

[slurm-users] Restricting local disk storage of jobs

2024-02-06 Thread Tim Schneider via slurm-users
Hi, In our SLURM cluster, we are using the job_container/tmpfs plugin to ensure that each user can use /tmp and it gets cleaned up after them. Currently, we are mapping /tmp into the nodes RAM, which means that the cgroups make sure that users can only use a certain amount of storage inside /

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-24 Thread Tim Schneider
rked until this recent change, then other kernel versions should show the same behavior. But as far as I can tell it still works just fine with newer kernels. Cheers, Stefan On Tue, 23 Jan 2024 15:20:56 +0100 Tim Schneider wrote: Hi, I have filed a bug report with SchedMD (https://bugs.schedmd.

Re: [slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-23 Thread Tim Schneider
we should check? On Thu, Jan 4, 2024 at 3:03?PM Tim Schneider wrote: Hi, I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled two of our nodes, I get the following error when launching a job: slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Plea

[slurm-users] slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK).

2024-01-04 Thread Tim Schneider
Hi, I am using SLURM 22.05.9 on a small compute cluster. Since I reinstalled two of our nodes, I get the following error when launching a job: slurmstepd: error: load_ebpf_prog: BPF load error (No space left on device). Please check your system limits (MEMLOCK). Also the cgroups do not seem

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Tim Schneider
then nextstate is irrelevant. We always use "reboot ASAP" because our cluster is usually so busy that nodes never become idle if left to themselves :-) FYI: We regularly make package updates and firmware updates using the "scontrol reboot asap" method which is explained in

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Tim Schneider
ion! Best, tim On 25.10.23 02:10, Christopher Samuel wrote: On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Esse

[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Tim Schneider
Hi, from my understanding, if I run "scontrol reboot ", the node should continue to operate as usual and reboots once it is idle. When adding the ASAP flag (scontrol reboot ASAP ), the node should go into drain state and not accept any more jobs. Now my issue is that when I run "scontrol reb

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-17 Thread Tim Schneider
Tim Schneider wrote: Hi again, I just realized that https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at some point that he build Slurm 22 instead of using the Ubuntu repo version. So I guess I will have to look into that. Best, Tim On 6/16/23 10:36, Tim Schneider wrote: Hi Abel

Re: [slurm-users] task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-17 Thread Tim Schneider
Tim Schneider wrote: Hi again, I just realized that https://groups.google.com/g/slurm-users/c/0dJhe5r6_2Q?pli=1 wrote at some point that he build Slurm 22 instead of using the Ubuntu repo version. So I guess I will have to look into that. Best, Tim On 6/16/23 10:36, Tim Schneider wrote: Hi Abel

[slurm-users] Fwd: task/cgroup plugin causes "srun: error: task 0 launch failed: Plugin initialization failed" error on Ubuntu 22.04

2023-06-15 Thread Tim Schneider
Hi, I am maintaining the SLURM cluster of my research group. Recently I updated to Ubuntu 22.04 and Slurm 21.08.5 and ever since, I am unable to launch jobs. When launching a job, I receive the following error: /$ srun --nodes=1 --ntasks-per-node=1 -c 1 --mem-per-cpu 1G --time=01:00:00 --pty