Hello!
Our jobs can ask for dedicated per-node disk space, e.g. "--gres=tmp:1G",
where an ephemeral directory is managed by the site prolog/epilog and
usage is capped using an xfs project quota. This works well, although we
really need to look at job_container/tmpfs.
I note that slurm alread
In addition to the other suggestions, there's this:
https://slurm.schedmd.com/faq.html#tmpfs_jobcontainer
https://slurm.schedmd.com/job_container.conf.html
I would be interested in hearing how well it works - it's so buried in the
documentation that unfortunately I didn't see it until after I r
On Thu, 5 May 2022, Legato, John (NIH/NHLBI) [E] wrote:
...
We are in the process of upgrading from Slurm 21.08.6 to Slurm
21.08.8-2. We’ve upgraded the controller and a few partitions worth of
nodes. We notice the nodes are losing contact with the controller but
slurmd is still up. We thought
On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...
You're right, probably the correct order for Configless must be:
* stop slurmctld
* edit slurm.conf etc.
* start slurmctld
* restart the slurmd nodes to pick up new slurm.conf
See also slides 29-34 in
https://slurm.schedmd.com/SLUG21/Field_Notes_5
On Thu, 5 May 2022, Ole Holm Nielsen wrote:
...
That is correct. Just do "scontrol reconfig" on the slurmctld server. If
all your slurmd's are truly running Configless[1], they will pick up the
new config and reconfigure without restarting.
Details are summarized in
https://wiki.fysik.dtu.dk/n
Hi all,
Sorry for the noise, this was down to a problem with our configless setup.
Really must start running slurmd everywhere and get rid of the
compute-only version of slurm.conf...
Cheers,
Mark
On Mon, 13 Dec 2021, Mark Dixon wrote:
[EXTERNAL EMAIL]
Hi all,
Just wondering if anyone
Hi all,
Just wondering if anyone else had seen this.
Running slurm 21.08.2, we're seeing srun work normally if it is able to
run immediately. However, if there is a delay in job start, for example
after a wait for another job to end, srun fails. e.g.
[test@foo ~]$ srun -p test --pty bash
On Wed, 17 Nov 2021, Bjørn-Helge Mevik wrote:
...
We are using basically the same setup, and have not found any other way
than running "scontrol show job ..." in the prolog (even though it is
not recommended). I have yet to see any problems arising from it, but
YMMW.
If you find a different way
Hi everyone,
I'd like to configure slurm such that users can request an amount of disk
space for TMPDIR... and for that request to be reserved and quota'd via
commands like "sbatch --gres tmp:10G jobscript.sh". Probably reinventing
someone's wheel, but I'm almost there.
I have:
- created a
rly UCNS)
-----Original Message-
From: slurm-users On Behalf Of Mark
Dixon
Sent: Wednesday, November 10, 2021 10:14
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file
[EXTERNAL SENDER - PROCEED CAUTIOUSLY]
Hi,
I'm using the "
Hi,
I'm using the "enable_configless" mode to avoid the need for a shared
slurm.conf file, and am having similar trouble to others when running
"srun", e.g.
srun: error: fwd_tree_thread: can't find address for host cn120, check
slurm.conf
srun: error: Task launch for StepId=113.0 failed
ay to
continue the job while still 'fixing' the issue. That could be done in
the TaskEpilog script (assuming your daemon user has permissions to do so).
On 5/24/2021 8:56 AM, Mark Dixon wrote:
Hi Brian,
Thanks for replying. On our hardware, GPUs allocated to a job by
cgroup sometimes
ittle more.
On 5/24/2021 3:02 AM, Mark Dixon wrote:
Hi all,
Sometimes our compute nodes get into a failed state which we can only
detect from inside the job environment.
I can see that TaskProlog / TaskEpilog allows us to run our detection
test; however, unlike Epilog and Prolog, they do n
Hi all,
Sometimes our compute nodes get into a failed state which we can only
detect from inside the job environment.
I can see that TaskProlog / TaskEpilog allows us to run our detection
test; however, unlike Epilog and Prolog, they do not drain a node if they
exit with a non-zero exit code
d for it, but both sets of hardware was busy.
In this case, I think I can safely move back to scaling down the partition
charge :)
Cheers,
Mark
--
Mark Dixon Tel: +44(0)191 33 41383
Advanced Research Computing (ARC), Durham University, UK
On Thu, 17 Sep 2020, Paul Edmon wrote:
So the way we handle it is that we give a blanket fairshare to everyone but
then dial in our TRES charge back on a per partition basis based on
hardware. Our fairshare doc has a fuller explanation:
https://docs.rc.fas.harvard.edu/kb/fairshare/
-Paul Ed
d managing users, particularly
as I've not figured out how to define a group of users in one place (say,
a unix group) that I can then use multiple times.
Is there a better way, please?
Thanks,
Mark
--
Mark Dixon Tel: +44(0)191 33 41383
Advanced Research Computing (ARC), Durham University, UK
On Wed, 16 Sep 2020, Niels Carl W. Hansen wrote:
If you explicitely specify the account, f.ex. 'sbatch -A myaccount'
then 'slurm.log_info("submit -- account %s", job_desc.account)'
works.
Great, thanks - that's working!
Of course I have other problems... :
On Wed, 16 Sep 2020, Diego Zuccato wrote:
...
From the source it seems these fields are available:
account
comment
direct_set_prio
gres
job_id Always nil ? Maybe no JobID yet?
job_state
licenses
max_cpus
max_nodes
min_
Hi all,
I'm trying to get started with the lua job_submit feature and I have a
really dumb question. This job_submit Lua script:
function slurm_job_submit( job_desc, part_list, submit_uid )
slurm.log_info("submit called lua plugin")
for k,v in pairs(job_desc) do
slurm.log_
efault qos for foo's jobs:
"sacctmgr modify user foo set qos=drain defaultqos=drain"
And then update the qos on all of foo's waiting jobs.
I'll be using David's GrpSubmitJobs=0 suggestion instead.
Thanks for everyone's help,
Mark
On Wed, 1 Apr 2020, Mark Dixon
un to completion and pending jobs won't start
Antony
On Wed, 1 Apr 2020 at 10:57, Mark Dixon wrote:
Hi all,
I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster.
I'd like to stop user foo from submitting new jobs but allow their
existing jobs to run.
We hav
done wrong.
Best,
Mark
On Wed, 1 Apr 2020, mercan wrote:
Hi;
If you have working job_submit.lua script, you can put a block new jobs of
the spesific user:
if job_desc.user_name == "baduser" then
return 2045
end
thats all!
Regards;
Ahmet M.
1.04.2020 16:22 tarih
we set the
GrpSubmit jobs on an account to 0 which allowed in-flight jobs to continue
but no new work to be submitted.
HTH,
David
On Wed, Apr 1, 2020 at 5:57 AM Mark Dixon wrote:
Hi all,
I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster.
I'd like to stop user foo
Hi all,
I'm a slurm newbie who has inherited a working slurm 16.05.10 cluster.
I'd like to stop user foo from submitting new jobs but allow their
existing jobs to run.
We have several partitions, each with its own qos and MaxSubmitJobs
typically set to some vaue. These qos are stopping a "sa
25 matches
Mail list logo