[slurm-users] Adding gres usage to the accounting database

2024-12-13 Thread Mark Dixon via slurm-users
Hello! Our jobs can ask for dedicated per-node disk space, e.g. "--gres=tmp:1G", where an ephemeral directory is managed by the site prolog/epilog and usage is capped using an xfs project quota. This works well, although we really need to look at job_container/tmpfs. I note that slurm alread

Re: [slurm-users] temporary SLURM directories

2022-05-25 Thread Mark Dixon
In addition to the other suggestions, there's this: https://slurm.schedmd.com/faq.html#tmpfs_jobcontainer https://slurm.schedmd.com/job_container.conf.html I would be interested in hearing how well it works - it's so buried in the documentation that unfortunately I didn't see it until after I r

Re: [slurm-users] Slurm 21.08.8-2 upgrade

2022-05-09 Thread Mark Dixon
On Thu, 5 May 2022, Legato, John (NIH/NHLBI) [E] wrote: ... We are in the process of upgrading from Slurm 21.08.6 to Slurm 21.08.8-2. We’ve upgraded the controller and a few partitions worth of nodes. We notice the nodes are losing contact with the controller but slurmd is still up. We thought

Re: [slurm-users] SLURM: reconfig

2022-05-06 Thread Mark Dixon
On Thu, 5 May 2022, Ole Holm Nielsen wrote: ... You're right, probably the correct order for Configless must be: * stop slurmctld * edit slurm.conf etc. * start slurmctld * restart the slurmd nodes to pick up new slurm.conf See also slides 29-34 in https://slurm.schedmd.com/SLUG21/Field_Notes_5

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Mark Dixon
On Thu, 5 May 2022, Ole Holm Nielsen wrote: ... That is correct. Just do "scontrol reconfig" on the slurmctld server. If all your slurmd's are truly running Configless[1], they will pick up the new config and reconfigure without restarting. Details are summarized in https://wiki.fysik.dtu.dk/n

Re: [slurm-users] srun fails with "srun: error: Security violation, slurm message from uid" if delay in job starting

2021-12-14 Thread Mark Dixon
Hi all, Sorry for the noise, this was down to a problem with our configless setup. Really must start running slurmd everywhere and get rid of the compute-only version of slurm.conf... Cheers, Mark On Mon, 13 Dec 2021, Mark Dixon wrote: [EXTERNAL EMAIL] Hi all, Just wondering if anyone

[slurm-users] srun fails with "srun: error: Security violation, slurm message from uid" if delay in job starting

2021-12-13 Thread Mark Dixon
Hi all, Just wondering if anyone else had seen this. Running slurm 21.08.2, we're seeing srun work normally if it is able to run immediately. However, if there is a delay in job start, for example after a wait for another job to end, srun fails. e.g. [test@foo ~]$ srun -p test --pty bash

Re: [slurm-users] Per-job TMPDIR: how to lookup gres allocation in prolog?

2021-11-17 Thread Mark Dixon
On Wed, 17 Nov 2021, Bjørn-Helge Mevik wrote: ... We are using basically the same setup, and have not found any other way than running "scontrol show job ..." in the prolog (even though it is not recommended). I have yet to see any problems arising from it, but YMMW. If you find a different way

[slurm-users] Per-job TMPDIR: how to lookup gres allocation in prolog?

2021-11-16 Thread Mark Dixon
Hi everyone, I'd like to configure slurm such that users can request an amount of disk space for TMPDIR... and for that request to be reserved and quota'd via commands like "sbatch --gres tmp:10G jobscript.sh". Probably reinventing someone's wheel, but I'm almost there. I have: - created a

Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-16 Thread Mark Dixon
rly UCNS) -----Original Message- From: slurm-users On Behalf Of Mark Dixon Sent: Wednesday, November 10, 2021 10:14 To: slurm-users@lists.schedmd.com Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file [EXTERNAL SENDER - PROCEED CAUTIOUSLY] Hi, I'm using the "

[slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-10 Thread Mark Dixon
Hi, I'm using the "enable_configless" mode to avoid the need for a shared slurm.conf file, and am having similar trouble to others when running "srun", e.g. srun: error: fwd_tree_thread: can't find address for host cn120, check slurm.conf srun: error: Task launch for StepId=113.0 failed

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-25 Thread Mark Dixon
ay to continue the job while still 'fixing' the issue. That could be done in the TaskEpilog script (assuming your daemon user has permissions to do so). On 5/24/2021 8:56 AM, Mark Dixon wrote: Hi Brian, Thanks for replying. On our hardware, GPUs allocated to a job by cgroup sometimes

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Mark Dixon
ittle more. On 5/24/2021 3:02 AM, Mark Dixon wrote: Hi all, Sometimes our compute nodes get into a failed state which we can only detect from inside the job environment. I can see that TaskProlog / TaskEpilog allows us to run our detection test; however, unlike Epilog and Prolog, they do n

[slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Mark Dixon
Hi all, Sometimes our compute nodes get into a failed state which we can only detect from inside the job environment. I can see that TaskProlog / TaskEpilog allows us to run our detection test; however, unlike Epilog and Prolog, they do not drain a node if they exit with a non-zero exit code

Re: [slurm-users] Fair share per partition

2020-09-17 Thread Mark Dixon
d for it, but both sets of hardware was busy. In this case, I think I can safely move back to scaling down the partition charge :) Cheers, Mark -- Mark Dixon Tel: +44(0)191 33 41383 Advanced Research Computing (ARC), Durham University, UK

Re: [slurm-users] Fair share per partition

2020-09-17 Thread Mark Dixon
On Thu, 17 Sep 2020, Paul Edmon wrote: So the way we handle it is that we give a blanket fairshare to everyone but then dial in our TRES charge back on a per partition basis based on hardware.  Our fairshare doc has a fuller explanation: https://docs.rc.fas.harvard.edu/kb/fairshare/ -Paul Ed

[slurm-users] Fair share per partition

2020-09-17 Thread Mark Dixon
d managing users, particularly as I've not figured out how to define a group of users in one place (say, a unix group) that I can then use multiple times. Is there a better way, please? Thanks, Mark -- Mark Dixon Tel: +44(0)191 33 41383 Advanced Research Computing (ARC), Durham University, UK

Re: [slurm-users] getting started with job_submit_lua

2020-09-16 Thread Mark Dixon
On Wed, 16 Sep 2020, Niels Carl W. Hansen wrote: If you explicitely specify the account, f.ex. 'sbatch -A myaccount' then 'slurm.log_info("submit -- account %s", job_desc.account)' works. Great, thanks - that's working! Of course I have other problems... :

Re: [slurm-users] getting started with job_submit_lua

2020-09-16 Thread Mark Dixon
On Wed, 16 Sep 2020, Diego Zuccato wrote: ... From the source it seems these fields are available: account comment direct_set_prio gres job_id Always nil ? Maybe no JobID yet? job_state licenses max_cpus max_nodes min_

[slurm-users] getting started with job_submit_lua

2020-09-15 Thread Mark Dixon
Hi all, I'm trying to get started with the lua job_submit feature and I have a really dumb question. This job_submit Lua script: function slurm_job_submit( job_desc, part_list, submit_uid ) slurm.log_info("submit called lua plugin") for k,v in pairs(job_desc) do slurm.log_

Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread Mark Dixon
efault qos for foo's jobs: "sacctmgr modify user foo set qos=drain defaultqos=drain" And then update the qos on all of foo's waiting jobs. I'll be using David's GrpSubmitJobs=0 suggestion instead. Thanks for everyone's help, Mark On Wed, 1 Apr 2020, Mark Dixon

Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread Mark Dixon
un to completion and pending jobs won't start Antony On Wed, 1 Apr 2020 at 10:57, Mark Dixon wrote: Hi all, I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster. I'd like to stop user foo from submitting new jobs but allow their existing jobs to run. We hav

Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread Mark Dixon
done wrong. Best, Mark On Wed, 1 Apr 2020, mercan wrote: Hi; If you have working job_submit.lua script, you can put a block new jobs of the spesific user: if job_desc.user_name == "baduser" then     return 2045 end thats all! Regards; Ahmet M. 1.04.2020 16:22 tarih

Re: [slurm-users] Drain a single user's jobs

2020-04-01 Thread Mark Dixon
we set the GrpSubmit jobs on an account to 0 which allowed in-flight jobs to continue but no new work to be submitted. HTH, David On Wed, Apr 1, 2020 at 5:57 AM Mark Dixon wrote: Hi all, I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster. I'd like to stop user foo

[slurm-users] Drain a single user's jobs

2020-04-01 Thread Mark Dixon
Hi all, I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster. I'd like to stop user foo from submitting new jobs but allow their existing jobs to run. We have several partitions, each with its own qos and MaxSubmitJobs typically set to some vaue. These qos are stopping a "sa