mp; Biases but that is code
specific: https://wandb.ai/site/ You can also use scontrol -d show job
to print out the layout of a job including which specific GPU's were
assigned.
-Paul Edmon-
On 4/2/25 9:17 AM, Jason Simms via slurm-users wrote:
Hello all,
Apologies for the basic
could submit to both the cpu
and the requeue partition (as slurm permits multipartition submissions)
and then the gpu partition won't be blocked by anything and you can farm
the space gpu cycles. This works well for our needs.
-Paul Edmon-
On 3/31/2025 9:39 AM, Paul Raines via slurm-u
Have you looked at this?
https://slurm.schedmd.com/slurm.conf.html#OPT_job_env Note that it will
eat up a ton of space in the database, so be warned.
-Paul Edmon-
On 3/13/25 3:36 AM, Bhaskar Chakraborty via slurm-users wrote:
Hi everyone,
I have tried my best to extract custom job
You want: https://slurm.schedmd.com/scontrol.html#OPT_hostnames
-Paul Edmon-
On 1/6/2025 2:58 PM, Davide DelVento via slurm-users wrote:
Found it, I should have asked to my puppet as it's mandatory in some
places :-D
It is simply
scontrol show hostname gpu[01-02],node[03-04,12-22,27-
Sadly I don't have any deeper insight into the C API for that information.
-Paul Edmon-
On 11/3/2024 2:14 AM, Bhaskar Chakraborty wrote:
Hi Paul,
Thanks for the tip. Looking at the code it seems to compare each job’s
priority
variable through its job_ptr.
I did some experiments where
partition, then you
can reorder based on that. squeue also can print out current priority.
You might also look at the --priority option:
https://slurm.schedmd.com/squeue.html#OPT_priority
-Paul Edmon-
On 10/29/24 9:33 AM, Bhaskar Chakraborty via slurm-users wrote:
Hello,
Is there any DS in
You might need to do some tuning on your backfill loop as that loop
should be the one that backfills in those lower priority jobs. I would
also look to see if those lower priority jobs will actually fit in prior
to the higher priority job running, they may not.
-Paul Edmon-
On 9/24/24 2:19
;). If one or
more numeric expressions are included, one of them must be at the end of
the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can
always be used in a comma-separated list."
-Paul Edmon-
On 9/5/24 3:24 PM, Jackson, Gary L. via slurm-users wrote:
Is ther
Its definitely working for 23.11.8, which is what we are using.
-Paul Edmon-
On 9/5/24 10:22 AM, Loris Bennett via slurm-users wrote:
Jason Simms via slurm-users writes:
Ours works fine, however, without the InteractiveStepOptions parameter.
My assumption is also that default value should
Thanks. I've made that fix.
-Paul Edmon-
On 8/28/24 5:42 PM, Davide DelVento wrote:
Thanks everybody once again and especially Paul: your job_summary
script was exactly what I needed, served on a golden plate. I just had
to modify/customize the date range and change the following line (I
tion about this. Lots of great ideas.
-Paul Edmon-
On 8/9/24 12:04 PM, Jeffrey T Frey wrote:
You'd have to do this within e.g. the system's bashrc infrastructure. The
simplest idea would be to add to e.g. /etc/profile.d/zzz-slurmstats.sh and have
some canned commands/scripts
e use
Reframe for our testing: https://github.com/fasrc/reframe-fasrc).
-Paul Edmon-
On 8/26/2024 3:28 PM, Ole Holm Nielsen via slurm-users wrote:
On 26-08-2024 20:30, Paul Edmon via slurm-users wrote:
I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.
I haven't seen any behavior like that. For reference we are running
Rocky 8.9 with MOFED 23.10.2
-Paul Edmon-
On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote:
Hi Paul,
On 26-08-2024 15:29, Paul Edmon via slurm-users wrote:
We've had this exact hardware for years no
issue.
That said you are free to reboot either node with out loss of
connectivity. We do that all the time with no issues. As noted though if
you want to actually physically service the nodes, then you have to take
out both.
-Paul Edmon-
On 8/26/2024 8:51 AM, Ole Holm Nielsen via slurm-users
Containers
-Paul Edmon-
On 8/23/24 2:21 PM, wdennis--- via slurm-users wrote:
We are getting a few calls to support container workloads on our Slurm cluster;
I want to support these user's usecases, so am looking into it now.
The problem for me is, I'm not super-familiar with containe
Ah, that's even more fun. I know with Singularity you can launch MPI
applications by calling MPI outside of the container and then having it
link to the internal version:
https://docs.sylabs.io/guides/3.3/user-guide/mpi.html Not sure about
docker though.
-Paul Edmon-
On 8/12/2024 10:
hostlist, your ranks may not end up properly bound to the
specific cores they are supposed to be allocated. So definitely proceed
with caution and validate your ranks are being laid out properly, as you
will be relying on mpirun/mpiexec to bootstrap rather than the scheduler.
-Paul Edmon-
On 8
l way to do it if you need to would be the scontrol show
hostnames command against the $SLURM_JOB_NODELIST
(https://slurm.schedmd.com/scontrol.html#OPT_hostnames). That will give
you the list of hosts your job is set to run on.
-Paul Edmon-
On 8/12/2024 8:34 AM, Jeffrey Layton via slurm-users
as a
environmental variable.
-Paul Edmon-
On 8/9/2024 12:34 PM, Jeffrey Layton via slurm-users wrote:
Good afternoon,
I know this question has been asked a million times, but what is the
canonical way to convert the list of nodes for a job that is container
in a Slurm variable,
Yup, we have that installed already. It's been very beneficial for over
all monitoring.
-Paul Edmon-
On 8/9/2024 12:27 PM, Reid, Andrew C.E. (Fed) wrote:
Maybe a heavier lift than you had in mind, but check
out xdmod, open.xdmod.org.
It was developed by the NSF as part of th
Yeah, I was contemplating doing that so I didn't have a dependency on
the scheduler being up or down or busy.
What I was more curious about is if any one had an prebaked scripts for
that.
-Paul Edmon-
On 8/9/2024 12:04 PM, Jeffrey T Frey wrote:
You'd have to do this withi
curious what other sites do and if they would be willing to
share their scripts and methodology.
-Paul Edmon-
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
I think this would be a good feature request. At least to me everything
you can get in scontrol show job should be in sacct in some form.
-Paul Edmon-
On 8/7/2024 9:29 AM, Steffen Grunewald wrote:
On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote:
Warning on that one, it can eat up a
.
-Paul Edmon-
On 8/7/2024 8:51 AM, Juergen Salk via slurm-users wrote:
Hi Steffen,
not sure if this is what you are looking for, but with
`AccountingStoreFlags=job_env´
set in slurm.conf, the batch job environment will be stored in the
accounting database and can later be retrieved with `sacct -j
That looks to be the case from my glance at sacct. Not everything in
scontrol show job ends up in sacct, which is a bit frustrating at times.
-Paul Edmon-
On 8/7/2024 8:08 AM, Steffen Grunewald via slurm-users wrote:
Hello everyone,
I've grepped the manual pages and crawled the
when the job ends the user's session will also end. However
if the user has no job on that node, then they can ssh as normal to that
host with out any problem.
-Paul Edmon-
On 7/8/2024 5:48 PM, Chris Taylor via slurm-users wrote:
On my Rocky9 cluster I got this to work fine also-
Added a
https://slurm.schedmd.com/upgrades.html#compatibility_window
Looks like no. You have to be with in 2 major releases.
-Paul Edmon-
On 6/17/24 5:40 AM, ivgeokig via slurm-users wrote:
Hello!
I have a question. I have the server 19.05.3. No chance to upgrade
it. Have I any chance to
There is no way to do it in slurm. You have to do it in the mail program
you are using to send mail. In our case we use postfix and we set
smtp_generic_maps to accomplish this.
-Paul Edmon-
On 6/7/2024 3:33 PM, Vanhorn, Mike via slurm-users wrote:
All,
When the slurm daemon is sending out
u are using a QoS to manage
this (which I am assuming you are), I would use sacctmgr.
As for a framework that does the state inspection, I'm not aware of one.
You could do it via cron and batch scripts to do the state inspection. I
don't know if some one has something more sophisticated
A friend ask me to pass this along. Figured some folks on this list
might be interested.
https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773
-Paul Edmon-
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to
Usually to clear jobs like this you have to reboot the node they are on.
That will then force the scheduler to clear them.
-Paul Edmon-
On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote:
We are running a slurm cluster with version `slurm 22.05.8`. One of our users
has reported
it to force jobs
to one side of the partition, though generally the scheduler does this
automatically.
-Paul Edmon-
On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote:
Agree with that. Plus, of course, even if the jobs run a bit slower
by not having all the cores on a single node, they wi
would be my recommendation. This is how we
handle fairshare at FASRC: https://docs.rc.fas.harvard.edu/kb/fairshare/
As we use Classic Fairshare. You will need to enable this:
https://slurm.schedmd.com/slurm.conf.html#OPT_NO_FAIR_TREE as Fair Tree
is on by default.
-Paul Edmon-
On 3/27/2024 9
utput for slurm
partition information
stdg: https://github.com/fasrc/stdg Slurm test deck generator
prometheus-slurm-exporter:
https://github.com/fasrc/prometheus-slurm-exporter Slurm exporters for
prometheus
Hopefully people find these useful. Pull requests are always appreciated.
-Paul
He's talking about recent versions of Slurm which now have this option:
https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step
-Paul Edmon-
On 2/28/2024 10:46 AM, Paul Raines wrote:
What do you mean "operate via the normal command line"? When
you salloc, you a
but swapped to
salloc a few years back and haven't had any issues.
-Paul Edmon-
On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote:
Hi list,
In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun
...&qu
poses. So we haven't heavily invested in a high speed
ethernet backbone but instead invested in IB.
To invest in both seems to me to be overkill, you should focus on one or
the other unless you have the cash to spend and a good use case.
-Paul Edmon-
On 2/26/24 7:07 AM, Dan Healy via s
Are you using the job_script storage option? If so then you should be
able to get at it by doing:
sacct -B j JOBID
https://slurm.schedmd.com/sacct.html#OPT_batch-script
-Paul Edmon-
On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote:
Hello all,
I've used the "scon
You probably want the Prolog option:
https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with:
https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail
-Paul Edmon-
On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote:
Hi, I apologise if I’ve failed to find this in the
t is some
obscure option.
-Paul Edmon-
On 1/29/2024 9:25 AM, Loris Bennett wrote:
Hi,
I seem to remember that in the past, if a node was configured to be in
two partitions, the actual partition of the node was determined by the
partition associated with the jobs running on it. Moreover, at an
ry setting that default of PreemptMode=CANCEL and then set
specific PreemptModes for all your partitions. That's what we do and it
works for us.
-Paul Edmon-
On 1/12/2024 10:33 AM, Davide DelVento wrote:
Thanks Paul,
I don't understand what you mean by having a typo somewhere. I mean,
At least in the example you are showing you have PreemptType commented
out, which means it will return the default. PreemptMode Cancel should
work, I don't see anything in the documentation that indicates it
wouldn't. So I suspect you have a typo somewhere in your conf.
-Paul Edmon
will work best for the policy you want to implement.
-Paul Edmon-
On 1/9/2024 10:43 AM, Kenneth Chiu wrote:
I'm just learning about slurm. I understand that different different
partitions can be prioritized separately, and can have different max
time limits. I was wondering whether or not t
t. A
partition would be all or nothing for a node so that would not work.
-Paul Edmon-
On 12/15/23 12:16 PM, Jason Simms wrote:
Hello all,
At least at one point, I understood that it was not particularly
possible, or at least not elegant, to provide priority preempt access
to a specific GPU
We've been running for years with out swap on with no issues. You may
want to set MemSpecLimit in your config to reserve memory for the OS, so
that way you don't OOM the system with user jobs:
https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit
-Paul Edmon-
On 12/11/202
You will probably need to.
The way we handle it is that we add users when the first submit a job
via the job_submit.lua script. This way the database autopopulates with
active users.
-Paul Edmon-
On 10/3/23 9:01 AM, Davide DelVento wrote:
By increasing the slurmdbd verbosity level, I got
At least in our setup, users can see their own scripts by doing sacct -B
-j JOBID
I would make sure that the scripts are being stored and how you have
PrivateData set.
-Paul Edmon-
On 10/2/2023 10:57 AM, Davide DelVento wrote:
I deployed the job_script archival and it is working, however it
paranoia
we general stop everything. The entire process takes about an hour start
to finish, with the longest part being the pausing of all the jobs.
-Paul Edmon-
On 9/29/2023 9:48 AM, Groner, Rob wrote:
I did already see the upgrade section of Jason's talk, but it wasn't
much abo
ssion which helps with the on disk size. Raw uncompressed our
database is about 90G. We keep 6 months of data in our active database.
-Paul Edmon-
On 9/28/2023 1:57 PM, Ryan Novosielski wrote:
Sorry for the duplicate e-mail in a short time: do you know (or
anyone) when the hashing was added
job_scripts as they are functionally the same and thus
you have many jobs pointed to the same script, but less so for job_envs.
-Paul Edmon-
On 9/28/2023 1:55 PM, Ryan Novosielski wrote:
Thank you; we’ll put in a feature request for improvements in that
area, and also thanks for the warning? I thought of
of them if they get large
is to 0 out the column in the table. You can ask SchedMD for the mysql
command to do this as we had to do it here to our job_envs.
-Paul Edmon-
On 9/28/2023 1:40 PM, Davide DelVento wrote:
In my current slurm installation, (recently upgraded to slurm
v23.02.3), I only
You might also try swapping to use srun instead of mpiexec as that way
slurm can give more direction as to what cores have been allocated to
what. I've found it in the past that mpiexec will ignore what Slurm
tells it.
-Paul Edmon-
On 9/22/23 8:24 AM, Lambers, Martin wrote:
Hello,
for
I would recommend standing up an instance of XDMod as it handles most of
this for you in its summary reports.
https://open.xdmod.org/10.0/index.html
-Paul Edmon-
On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote:
Good morning,
We have at least one billed account right now, where the
We do this for our Infiniband set up. What we do is that we populate
/etc/hosts with the hostname mapped to the IP we want Slurm to use.
This way you get IP traffic traversing the address you want between
nodes while not having to mess with DNS.
-Paul Edmon-
On 3/14/2023 12:19 AM, Purvesh
We have a gitlab runner that fires up a docker container that basically
starts up a mini scheduler (slurmdbd and slurmctld) to confirm that both
can start. It covers most bases but we would like to see an official
syntax checker (https://bugs.schedmd.com/show_bug.cgi?id=3435).
-Paul Edmon
The symlink method for slurm.conf is what we do as well. We have a NFS
mount from the slurm master that we host the slurm.conf on that we then
symlink slurm.conf to that NFS share.
-Paul Edmon-
On 1/4/2023 1:53 PM, Brian Andrus wrote:
One of the simple ways I have dealt with different
The seff utility (in slurm-contribs) also gives good summary info.
You can also you --parsable to make things more managable.
-Paul Edmon-
On 12/14/22 3:41 PM, Ross Dickson wrote:
I wrote a simple Python script to transpose the output of sacct from a
row into a column. See if it meets your
Yeah, our spec is based off of their spec with our own additional
features plugged in.
-Paul Edmon-
On 12/2/22 2:12 PM, David Thompson wrote:
Hi Paul, thanks for passing that along. The error I saw was coming
from the rpmbuild %check stage in the el9/fc38 builds, which your
.spec file
Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8.
-Paul Edmon-
On 12/2/22 12:21 PM, David Thompson wrote:
Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8
Slurm cluster. We would like to be able to use the sbatch –prefer
option, which isn’t
It only happens for versions on the 22.05 series prior to the latest
release (22.05.5). So the 21 version isn't impacted and you should be
fine to upgrade from 21 to 22.05.5 and not see the hash_k12 issue. If
you upgrade to any prior minor version though you will hit this issue.
-Paul
the HA setup for slurmctld will protect you from the server hosting
the slurmctld getting hosed, not the entire rack going down or the
datacenter going down.
-Paul Edmon-
On 10/24/2022 4:14 AM, Ole Holm Nielsen wrote:
On 10/24/22 09:57, Diego Zuccato wrote:
Il 24/10/2022 09:32, Ole Holm
The slurmctld log will print out if hosts are out of sync with the
slurmctld slurm.conf. That said it doesn't report on cgroup consistency
changes like that. It's possible that dialing up the verbosity on the
slurmd logs may give that info but I haven't seen it in normal ope
our database is bigger than that.
-Paul Edmon-
On 9/25/22 5:18 PM, byron wrote:
Hi
Does anyone know what is the recommended amount of memory to give
slurms mariadb database server?
I seem to remember reading a simple estimate based on the size of
certain tables (or something along those
We also call scontrol in our scripts (a little as we can manage) and we
run at the scale of 1500 nodes. It hasn't really caused many issues,
but we try to limit it as much as we possibly can.
-Paul Edmon-
On 9/16/22 9:41 AM, Sebastian Potthoff wrote:
Hi Hermann,
So you both are ha
But not any 20. There are 20 versions, 20.02 and 20.11, and there was a
previous 19.05. So two versions for 18.08 would be 20.02 not 20.11
-Paul Edmon-
On 9/8/22 12:14 PM, Wadud Miah wrote:
The previous version was 18 and now I am trying to upgrade to 20, so I
am well within 2 major
Typically slurm only supports upgrading between 2 major versions ahead.
If you are on 18.08 you likely can only go to 20.02. Then after you
upgrade to 20.02 you can go to 20.11 or 21.08.
-Paul Edmon-
On 9/8/22 11:38 AM, Wadud Miah wrote:
hi Mick,
I have checked that all the compute nodes
I've regularly upgraded the mariadb version with out upgrading the slurm
version with no issue. We are currently running 10.6.7 for MariaDB on
CentOS 7.9 with Slurm 22.05.2. So long as you do the mysql_upgrade
after the upgrade and have a backup just in case you should be fine.
-Paul
True. Though be aware that Slurm will by default map the environment
from login nodes to compute. That's the real thing that matters. So as
long as the environment is setup properly, any filesystems excluding the
home directory do not need to be mounted on login.
-Paul Edmon-
On 8/2
No, the node running the slurmctld does not need access to any of the
customer facing filesystems or home directories. While all the login
and client nodes do, the slurmctld does not.
-Paul Edmon-
On 8/2/2022 9:30 AM, Richard Chang wrote:
Hi,
I am new to SLURM, so please bear with me.
I
ter=6month
PurgeTXNAfter=6month
PurgeUsageAfter=6month
-Paul Edmon-
On 7/15/2022 2:08 AM, Ole Holm Nielsen wrote:
Hi Paul,
On 7/14/22 15:10, Paul Edmon wrote:
We just use the Archive function built into slurm. That has worked
fine for us for the past 6 years. We keep 6 months of data in the
acti
22.05 so that it is more efficient but getting from
here to there is the trick.
For details see the bug report we filed:
https://bugs.schedmd.com/show_bug.cgi?id=14514
-Paul Edmon-
On 7/14/2022 2:34 PM, Timony, Mick wrote:
What I can tell you is that we have never had a problem
cripts and envs.
-Paul Edmon-
On 7/14/2022 12:55 PM, Timony, Mick wrote:
Hi Paul
If you have 6 years worth of data and you want to prune down to 2
years, I recommend going month by month rather than doing it in
one go. When we initially started archiving data several years
back
archive one month at a time which allowed it to get done in a
reasonable amount of time.
The archived data can be pulled into a different slurm database, which
is what we do for importing historic data into our XDMod instance.
-Paul Edmon-
On 7/13/2022 4:55 PM, Timony, Mick wrote:
Hi Slurm
sorts of problems.
-Paul Edmon-
On 5/17/22 2:50 PM, Ole Holm Nielsen wrote:
Hi,
You can upgrade from 19.05 to 20.11 in one step (2 major releases),
skipping 20.02. When that is completed, it is recommended to upgrade
again from 20.11 to 21.08.8 in order to get the current major version.
The
I think it should be, but you should be able to run a test and find out.
-Paul Edmon-
On 5/17/22 12:13 PM, byron wrote:
Sorry, I should have been clearer. I understand that with regards to
slurmd / slurmctld you can skip a major release without impacting
running jobs etc. My questions was
s they can hand out if
you are bootstrapping to a newer release.
-Paul Edmon-
On 5/17/22 11:42 AM, byron wrote:
Thanks Brian for the speedy responce.
Am I not correct in thinking that if I just go from 19.05 to 20.11
then there is the advantage that I can upgrade slurmd and slurmctld in
one
They fix this in newer versions of Slurm. We had the same issue with
older versions so we hard to run with the config_override option on to
keep the logs quiet. They changed the way logging was done in the more
recent releases and its not as chatty.
-Paul Edmon-
On 5/12/22 7:35 AM, Per
We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we
saw the communications issues described by Tim W. We upgraded to
21.08.8-2 this morning and that did the trick to resolve all the
communications problems we were having.
-Paul Edmon-
On 5/6/2022 4:38 AM, Ole Holm
them when you absolutely have no other work around then you should be fine.
-Paul Edmon-
On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote:
Hi, all:
We need to detect some problem at job end timepoint, so we write some
detection script in slurm epilog, which should drain the node if chec
tting hard limits for each user.
-Paul Edmon-
On 4/12/2022 8:55 AM, Chagai Nota wrote:
Hi Loris
Thanks for your answer.
I tired to configure it and I didn't get desired results.
This is my configuration:
PriorityType=priority/multifactor
PriorityDecayHalfLife=0
PriorityUsageRe
I think you could do this by clever use of a partition level QoS but I
don't have an obvious way of doing this.
-Paul Edmon-
On 3/22/2022 11:40 AM, Russell Jones wrote:
Hi all,
For various reasons, we need to limit a partition to being able to run
max 1 job at a time. Not 1 job per
older versions of MPI):
https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS What we've
recommended to users who have hit this was to swap over to using srun
instead of mpirun and the situation clears up.
-Paul Edmon-
On 2/10/2022 8:59 AM, Ward Poelmans wrote:
Hi Paul,
On 10/02/20
, the specified memory will only be unavailable for user
allocations.
These will restrict specific memory and cores for system use. This is
probably the best way to go rather than spoofing your config.
-Paul Edmon-
On 1/7/2022 2:36 AM, Rémi Palancher wrote:
Le jeudi 6 janvier 2022 à 22:39,
You can actually spoof the number of cores and RAM on a node by using
the config_override option. I've used that before for testing
purposes. Mind you core binding and other features like that will not
work if you start spoofing the number of cores and ram, so use with caution.
-Paul
Just of our curiosity is there a reason you aren't just doing a
mysqldump of the extant DB and then reimporting it?
I'm not aware of a way to dump just the qos settings for import other than:
sacctmgr show qos
-Paul Edmon-
On 12/17/2021 10:24 AM, Williams, Jenny Avis wrote:
Sac
ably
ping SchedMD as to any limitations they are aware of. Usually they are
pretty good about being comprehensive in their docs so they would have
probably mentioned it if there was one.
-Paul Edmon-
On 12/13/2021 5:07 AM, Loris Bennett wrote:
Hi Paul,
Am I right in assuming that there are g
is writing your sql into the database.
So you could set up a full mirror and then read the old archives into
that. You just want to make sure that mirror has archiving/purging
turned off so it won't rearchive the data you restored.
-Paul Edmon-
On 12/10/2021 1:28 PM, Ransom, Geoff
e dump and reimport will take a while
(for me it was about 4 hours start to finish on my test system).
-Paul Edmon-
On 12/2/2021 1:06 PM, Baer, Troy wrote:
My site has just updated to Slurm 21.08 and we are looking at moving to the
built-in job script capture capability, so I'm curiou
also have
all our internode IP comms going over our IB fabric and it works fine.
-Paul Edmon-
On 12/7/2021 11:05 AM, David Baker wrote:
Hello,
These days we have now enabled topology aware scheduling on our Slurm
cluster. One part of the cluster consists of two racks of AMD compute
no
I would check that you have MariaDB-shared installed too on the host you
build on prior to your build. The changed the way the packaging is done
in MariaDB and Slurm needs to detect the files in MariaDB-shared to
actually trigger the configure to build the mysql libs.
-Paul Edmon-
On 12/3
*PreemptMode* for this partition. It can
be set to OFF to disable preemption and gang scheduling for this
partition. See also *PriorityTier* and the above description of the
cluster-wide *PreemptMode* parameter for further details.
This is at least how we manage that.
-Paul Edmon-
On
g all the jobs and scheduling this is some
what mitigated, though jobs will still exit due to timeout.
-Paul Edmon-
On 10/25/2021 4:47 AM, Alan Orth wrote:
Dear Jurgen and Paul,
This is an interesting strategy, thanks for sharing. So if I read the
scontrol man page correctly, `scontrol su
Yup, we follow the same process for when we do Slurm upgrades, this
looks analogous to our process.
-Paul Edmon-
On 10/19/2021 3:06 PM, Juergen Salk wrote:
Dear all,
we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although
then have it reject any
changes that cause failure. It's not perfect but it works. A real
syntax checker would be better.
-Paul Edmon-
On 10/12/2021 4:08 PM, bbenede...@goodyear.com wrote:
Is there any sort of syntax checker that we could run our slurm.conf file
through before com
ernal to
an account/group/lab? What solutions have people used for this?
-Paul Edmon-
I think you can accomplish this by setting Partition QoS and defining it
to hook into the same QoS for all there. I believe that would force it
to share the same pool.
That said I don't know if that would work properly, its worth a test.
That is my first guess though.
-Paul Edmon-
O
its the sum total of all the TRES a Group could run in a
partition at one time.
-Paul Edmon-
On 8/2/2021 12:05 PM, Adrian Sevcenco wrote:
On 8/2/21 6:26 PM, Paul Edmon wrote:
Probably more like
MaxTRESPERJob=cpu=8
i see, thanks!!
i'm still searching for the definition of GrpTRES :)
T
Probably more like
MaxTRESPERJob=cpu=8
You would need to specify how much TRES you need for each job in the
normal tres format.
-Paul Edmon-
On 8/2/2021 11:24 AM, Adrian Sevcenco wrote:
On 8/2/21 5:44 PM, Paul Edmon wrote:
You can set up a Partition based QoS that can set this limit
You can set up a Partition based QoS that can set this limit:
https://slurm.schedmd.com/resource_limits.html See the MaxTRESPerJob limit.
-Paul Edmon-
On 8/2/2021 10:40 AM, Adrian Sevcenco wrote:
Hi! Is there a way to declare that jobs can request up to 8 cores?
Or is it allowed by default
Not in the current version of Slurm. In the next major version long
term storage of job scripts will be available.
-Paul Edmon-
On 7/16/2021 2:16 PM, David Henkemeyer wrote:
If I execute a bunch of sbatch commands, can I use sacct (or something
else) to show me the original sbatch command
1 - 100 of 260 matches
Mail list logo