Hiya,
On 4/15/25 7:03 pm, lyz--- via slurm-users wrote:
Hi, Christ. Thank you for continuing paying attention to this issue.
I followed your instuction. And This is the output:
[root@head1 ~]# systemctl cat slurmd | fgrep Delegate
Delegate=yes
That looks good to me, thanks for sharing that!
On 4/15/25 6:57 pm, lyz--- via slurm-users wrote:
Hi, Sean. It's the latest slurm version.
[root@head1 ~]# sinfo --version
slurm 22.05.3
That's quite old (and no longer supported), the oldest still supported
version is 23.11.10 and 24.11.4 came out recently.
What does the cgroup.conf file o
On 4/15/25 12:55 pm, Sean Crosby via slurm-users wrote:
What version of Slurm are you running and what's the contents of your
gres.conf file?
Also what does this say?
systemctl cat slurmd | fgrep Delegate
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
--
slurm-users maili
On 4/14/25 6:27 am, lyz--- via slurm-users wrote:
This command is intended to limit user 'lyz' to using a maximum of 2 GPUs. However, when the user
submits a job using srun, specifying CUDA 0, 1, 2, and 3 in the job script, or
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3", the job still utili
Hi Steven,
On 4/9/25 5:00 pm, Steven Jones via slurm-users wrote:
Apr 10 10:28:52 vuwunicohpcdbp1.ods.vuw.ac.nz slurmdbd[2413]: slurmdbd:
fatal: This host not configured to run SlurmDBD ((vuwunicohpcdbp1 or
vuwunicohp>
^^^ that's the critical error message, and it's reporting that because
s
On 3/4/25 5:23 pm, Steven Jones via slurm-users wrote:
However mysql -u slurm -p works just fine so it seems to be a config
error for slurmdbd
Try:
mysql -h 127.0.0.1 -u slurm -p
IIRC without that it'll try a UNIX domain socket and not try and connect
via TCP/IP.
--
Chris Samuel : h
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:
I observed similar symptoms when we had issues with the shared Lustre
file system. When the file system couldn't complete an I/O operation,
the process in Slurm remained in the CG state until the file system
became responsive again. An a
On 2/3/25 2:33 pm, Steven Jones via slurm-users wrote:
Just built 4 x rocky9 nodes and I do not get that error (but I get
another I know how to fix, I think) so holistically I am thinking the
version difference is too large.
Oh I think I missed this - when you say version difference do you m
On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote:
I have restarted the slurmctld and slurmd services several times. I
hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root
with the same result.
Are your nodes in the `FUTURE` state perhaps? What does this show?
si
On 10/28/24 10:56 am, Bhaskar Chakraborty via slurm-users wrote:
Is there an option in slurm to launch a custom script at the time of job
submission through sbatch
or salloc? The script should run with submit user permission in submit area.
I think you are after the cli_filter functionality w
Hi Ole,
On 10/22/24 11:04 am, Ole Holm Nielsen via slurm-users wrote:
Some time ago it was recommended that UnkillableStepTimeout values above
127 (or 256?) should not be used, see https://support.schedmd.com/
show_bug.cgi?id=11103. I don't know if this restriction is still valid
with recent
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any
thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an
uninterruptible sleep state. You can define
On 8/15/24 7:04 am, jpuerto--- via slurm-users wrote:
I am referring to the REST API. We have had it installed for a few years and have
recently upgraded it so that we can use v0.0.40. But this most recent version is missing
the "get_user_environment" field which existed in previous versions.
G'day Sid,
On 7/31/24 5:02 pm, Sid Young via slurm-users wrote:
I've been waiting for node to become idle before upgrading them however
some jobs take a long time. If I try to remove all the packages I assume
that kills the slurmstep program and with it the job.
Are you looking to do a Slurm
On 6/21/24 3:50 am, Arnuld via slurm-users wrote:
I have 3500+ GPU cores available. You mean each GPU job requires at
least one CPU? Can't we run a job with just GPU without any CPUs?
No, Slurm has to launch the batch script on compute node cores and it
then has the job of launching the users
On 6/17/24 7:24 am, Bjørn-Helge Mevik via slurm-users wrote:
Also, server must be newer than client.
This is the major issue for the OP - the version rule is:
slurmdbd >= slurmctld >= slurmd and clients
and no more than the permitted skew in versions.
Plus, of course, you have to deal with
On 5/22/24 3:33 pm, Brian Andrus via slurm-users wrote:
A simple example is when you have nodes with and without GPUs.
You can build slurmd packages without for those nodes and with for the
ones that have them.
FWIW we have both GPU and non-GPU nodes but we use the same RPMs we
build on both
Hi Jeff!
On 5/15/24 10:35 am, Jeffrey Layton via slurm-users wrote:
I have an Ubuntu 22.04 server where I installed Slurm from the Ubuntu
packages. I now want to install pyxis but it says I need the Slurm
sources. In Ubuntu 22.04, is there a package that has the source code?
How to download t
On 5/6/24 3:19 pm, Nuno Teixeira via slurm-users wrote:
Fixed with:
[...]
Thanks and sorry for the noise as I really missed this detail :)
So glad it helped! Best of luck with this work.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
--
slurm-users mailing list -- slu
On 5/6/24 6:38 am, Nuno Teixeira via slurm-users wrote:
Any clues about "elf_aarch64" and "aarch64elf" mismatch?
As I mentioned I think this is coming from the FreeBSD patching that's
being done to the upstream Slurm sources, specifically it looks like
elf_aarch64 is being injected here:
/
On 5/4/24 4:24 am, Nuno Teixeira via slurm-users wrote:
Any clues?
> ld: error: unknown emulation: elf_aarch64
All I can think is that your ld doesn't like elf_aarch64, from the log
your posting it looks that's being injected from the FreeBSD ports
system. Looking at the man page for ld on
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote:
In our case, that node has been removed from the cluster and cannot be
added back right now ( is being used for some other work ). What can we
do in such a case?
Mark the node as "DOWN" in Slurm, this is what we do when we get job
On 3/3/24 23:04, John Joseph via slurm-users wrote:
Is SWAP a mandatory requirement
All our compute nodes are diskless, so no swap on them.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an e
Hi Robert,
On 2/23/24 17:38, Robert Kudyba via slurm-users wrote:
We switched over from using systemctl for tmp.mount and change to zram,
e.g.,
modprobe zram
echo 20GB > /sys/block/zram0/disksize
mkfs.xfs /dev/zram0
mount -o discard /dev/zram0 /tmp
[...]
> [2024-02-23T20:26:15.881] [530.exter
On 1/10/24 19:39, Drucker, Daniel wrote:
What am I misunderstanding about how sacct filtering works here? I would
have expected the second command to show the exact same results as the
first.
You need to specify --end NOW for this to work as expected. From the man
page:
WITHOUT --jobs AN
On 11/24/23 06:16, Heckes, Frank wrote:
My colleagues are using this toolchains on Jülich cluster (especially
Juwels). My question is whether these eb files can be shared ? I would
be interested especially in the ones using NVHPC as core module.
If Jülich developed that toolchain then I think
On 10/29/23 03:13, John Joseph wrote:
Like to know that what is the maximum scalled up instance of SLURM so far.
Cori (which we retired mid-year) had ~12,000 compute nodes in case that
helps.
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 10/24/23 12:39, Tim Schneider wrote:
Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". N
On 10/16/23 08:22, Groner, Rob wrote:
It is my understanding that it is a different issue than pmix.
That's my understanding too. The PMIx issue wasn't in Slurm, it was in
the PMIx code that Slurm was linked to. This CVE is for Slurm itself.
--
Chris Samuel : http://www.csamuel.org/ : B
On 10/11/23 07:27, Cristian Huza wrote:
I recall there was a built in tool named seff (slurm efficiency), not
sure if it is still maintained
"seff" is in the Slurm sources in the contribs/seff directory, if you're
building RPMs from them then it's in the "slurm-contribs" RPM.
--
Chris Samue
On 10/13/23 10:10, Angel de Vicente wrote:
But, in any case, I would still be interested in a site factor plugin
example, because I might revisit this in the future.
I don't know if you saw, but there is a skeleton example in the Slurm
sources:
src/plugins/site_factor/none
Not sure if that
On 7/14/23 1:10 pm, Wilson, Steven M wrote:
It's not so much whether a job may or may not access the GPU but rather
which GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what
controls what our CUDA jobs can see and therefore use (within any
cgroups constraints, of course). In my case
On 8/2/23 2:30 pm, Sandor wrote:
I am looking to track accounting and job data. Slurm requires the use of
MySQL or MariaDB. Has anyone created the needed tables within PostGreSQL
then had slurmdbd write to it? Any problems?
From memory (and confirmed by git) support for Postgres was removed
On 7/14/23 10:20 am, Wilson, Steven M wrote:
I upgraded Slurm to 23.02.3 but I'm still running into the same problem.
Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still
being made available to jobs so we end up with compute jobs being run on
GPUs which should only be used
On 6/6/23 1:33 pm, Heinz, Michael wrote:
I've gone through the man pages for slurm.conf but I can't find anything about
how to define who the admins are? Is there still a way to do this with slurm or
has the ability been removed?
Looks like that was disabled over 3 years ago.
commit dd111a5
On 5/25/23 4:16 pm, Markuske, William wrote:
I have a badly behaving user that I need to speak with and want to
temporarily disable their ability to submit jobs. I know I can change
their account settings to stop them. Is there another way to set a block
on a specific username that I can lift
On 5/24/23 11:39 am, Fulton, Ben wrote:
Hi,
Hi Ben,
The release notes for 23.02 say “Added usage gathering for gpu/nvml
(Nvidia) and gpu/rsmi (AMD) plugins”.
How would I go about enabling this?
I can only comment on the nvidia side (as those are the GPUs we have)
but for that you need S
On 5/23/23 10:33 am, Pritchard Jr., Howard wrote:
Thanks Christopher,
No worries!
This doesn't seem to be related to Open MPI at all except that for our 5.0.0
and newer one has to use PMix to talk to the job launcher.
I built MPICH 4.1 on Perlmutter using the --with-pmix option and see a si
Hi Tommi, Howard,
On 5/22/23 12:16 am, Tommi Tervo wrote:
23.02.2 contains PMIx permission regression, it may be worth to check if it's
case?
I confirmed I could replicate the UNPACK-INADEQUATE-SPACE messages
Howard is seeing on a test system, so I tried that patch on that same
system with
Hi Lawrence,
On 5/17/23 3:26 pm, Sorrillo, Lawrence wrote:
Here is the error I get:
slurmctld: fatal: Can not recover assoc_usage state, incompatible
version, got 9728 need >= 8704 <= 9216,
The slurm version is: 20.11.9
That error seems to appear when slurmctld is loading usage data from
On 3/7/23 6:46 am, Groner, Rob wrote:
Over global settings are PreemptMode=SUSPEND,GANG and
PreemptType=preempt/partition_prio. We have a high priority partition
that nothing should ever preempt, and an open partition that is always
preemptable. In between is a burst partition. It can be pr
On 2/10/23 11:06 am, Analabha Roy wrote:
I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in
my cluster.
If you're looking to try checkpointing MPI applications you may want to
experiment with the MANA ("MPI-Agnostic, Network-Agnostic MPI") plugin
for DMTCP here: https:/
On 1/19/23 5:01 am, Stefan Staeglich wrote:
Hi,
Hiya,
I'm wondering where the UnkillableStepProgram is actually executed. According
to Mike it has to be available on every on the compute nodes. This makes sense
only if it is executed there.
That's right, it's only executed on compute nodes
On 11/2/22 4:45 pm, Juergen Salk wrote:
However, instead of using `srun --pty bash´ for launching interactive jobs, it
is now recommended to use `salloc´ and have
`LaunchParameters=use_interactive_step´
set in slurm.conf.
+1 on that, this is what we've been using since it landed.
--
Chris Sa
On 10/31/22 5:46 am, Davide DelVento wrote:
Thanks for helping me find workarounds.
No worries!
My only other thought is that you might be able to use node features &
job constraints to communicate this without the user realising.
I am not sure I understand this approach.
I was just tryi
On 8/3/22 11:47 am, Benjamin Arntzen wrote:
At risk of being a heretic, why not something like Ansible to handle this?
Nothing heretical about that, but for us the reason is that `scontrol
reboot ASAP` is integrated nicely into the scheduling of jobs, we have
health checks and node epilogs t
On 8/3/22 8:37 am, Phil Chiu wrote:
Therefore my problem is this - "Reboot all nodes, permitting N nodes to
be rebooting simultaneously."
I think currently the only way to do that would be to have a script that
does:
* issue the `scontrol reboot ASAP nextstate=resume [...]` for 3 nodes
* wa
On 7/18/22 3:45 pm, gphipps wrote:
Everyone so often one of our users accidentally writes a “fork-bomb”
that submits thousands of sbatch and srun requests per second. It is a
giant DDOS attack on our scheduler. Is there a way of rate limiting
these requests before they reach the daemon?
Yes
On 6/3/22 11:39 am, Ransom, Geoffrey M. wrote:
Adding “--export=NONE” to the job avoids the problem, but I’m not seeing
a way to change this default behavior for the whole cluster.
There's an SBATCH_EXPORT environment variable that you could set for
users to force that (at $JOB-1 we used to d
On 5/29/22 3:09 pm, byron wrote:
This is the first time I've done an upgrade of slurm and I had been
hoping to do a rolling upgrade as opposed to waiting for all the jobs to
finish on all the compute nodes and then switching across but I dont see
how I can do it with this setup. Does any on
On 5/17/22 12:00 pm, Paul Edmon wrote:
Database upgrades can also take a while if your database is large.
Definitely recommend backing up prior to upgrade as well as running
slurmdbd -Dv and not the systemd daemon as if the upgrade takes a
long time it will kill it preemptively due to unre
On 5/5/22 7:08 am, Mark Dixon wrote:
I'm confused how this is supposed to be achieved in a configless
setting, as slurmctld isn't running to distribute the updated files to
slurmd.
That's exactly what happens with configless mode, slurmd's retrieve
their config from the slurmctld, and will g
On 5/5/22 5:17 am, Steven Varga wrote:
Thank you for the quick reply! I know I am pushing my luck here: is it
possible to modify slurm: src/common/[read_conf.c, node_conf.c]
src/slurmctld/[read_config.c, ...] such that the state can be maintained
dynamically? -- or cheaper to write a job manag
On 5/4/22 7:26 pm, Steven Varga wrote:
I am wondering what is the best way to update node changes, such as
addition and removal of nodes to SLURM. The excerpts below suggest a
full restart, can someone confirm this?
You are correct, you need to restart slurmctld and slurmd daemons at
present
On 2/8/22 11:41 pm, Alexander Block wrote:
I'm just discussing a familiar case with SchedMD right now (ticket
13309). But it seems that it is not possible with Slurm to submit jobs
that request features/configuration that are not available at the moment
of submission.
Does --hold not allow t
On 2/8/22 2:26 pm, z1...@arcor.de wrote:
These jobs should be accepted, if a suitable node will be active soon.
For example, these jobs could be in PartitionConfig.
From memory if you submit jobs with the `--hold` option then you should
find they are successfully accepted - I've used that in
On 1/31/22 9:25 pm, Brian Andrus wrote:
touch /etc/nologin
That will prevent new logins.
It's also useful that if you put a message in /etc/nologin then users
who are trying to login will get that message before being denied.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org
On 1/31/22 9:00 pm, Christopher Samuel wrote:
That would basically be the way
Thinking further on this a better way would be to mark your partitions
down, as it's likely you've got fewer partitions than compute nodes.
All the best,
Chris
--
Chris Samuel : http://www.c
On 1/31/22 4:41 pm, Sid Young wrote:
I need to replace a faulty DIMM chim in our login node so I need to stop
new jobs being kicked off while letting the old ones end.
I thought I would just set all nodes to drain to stop new jobs from
being kicked off...
That would basically be the way, bu
On 1/16/22 7:41 pm, Nicolas Greneche wrote:
I add a new compute node in config file so, Nodename becomes :
When adding a node you need to restart slurmctld and all the slurmd's as
they (currently) can only rebuild their internal structures for this at
that time. This is meant to be addressed
On 12/1/21 5:51 am, Gestió Servidors wrote:
I can’t syncronize before with “ntpdate” because when I run “ntpdate -s
my_NTP_server”, I only received message “ntpdate: no server suitable for
synchronization found”…
Yeah, you'll need to make sure your NTP infrastructure is working first.
There
On 12/1/21 3:27 pm, Brian Andrus wrote:
If you truly want something like this, you could have a wrapper script
look at available nodes, pick a random one and set the job to use that
node.
Alternatively you could have a cron job that adjusted nodes `weight`
periodically to change which ones S
On 11/22/21 8:28 pm, Jeherul Islam wrote:
Is there any way to configure slurm, that the High Priority job waits
for a certain amount of time(say 24 hours), before it preempts the other
job?
Not quite, but you can set PreemptExemptTime which says how long a job
must have run for before it can
On 11/16/21 8:04 am, Arthur Toussaint wrote:
I've seen people having those kind of problems, but no one seem to be
able to solve it and keep the cgroups
Debian Bullseye switched to cgroups v2 by default which Slurm doesn't
support yet, you'll need to switch back to the v1 cgroups. The release
On 11/16/21 7:07 am, Jaep Emmanuel wrote:
> root@ecpsc10:~# scontrol show node ecpsc10
[...]
>State=DOWN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Node unexpectedly rebooted [slurm@2021-11-16T14:41:04]
This is why the node isn't considered available, as o
On 8/7/21 11:47 pm, Adrian Sevcenco wrote:
yes, the jobs that are running have a part of file saving if they are
killed,
saving which depending of the target can get stuck ...
i have to think for a way to take a processes snapshot when this happens ..
Slurm does let you request a signal a cer
Hi Andrea,
On 7/9/21 3:50 am, Andrea Carotti wrote:
ProctrackType=proctrack/pgid
I suspect this is the cause of your problems, my bet is that it is
incorrectly identifying the users login processes as being part of the
job and thinking it needs to tidy them up in addition to any processes
On 7/1/21 7:08 am, Brian Andrus wrote:
I have a partition where one of the nodes has a node-locked license.
That license is not used by everyone that uses the partition.
This might be a case for using a reservation on that node with the
MaxStartDelay flag to set the maximum amount of time (in
On 7/1/21 3:26 pm, Sid Young wrote:
I have exactly the same issue with a user who needs the reported cores
to reflect the requested cores. If you find a solution that works please
share. :)
The number of CPUs in teh system vs the number of CPUs you can access
are very different things. You c
On 6/4/21 11:04 am, Ahmad Khalifa wrote:
Because there are failing GPUs that I'm trying to avoid.
Could you remove them from your gres.conf and adjust slurm.conf to match?
If you're using cgroups enforcement for devices (ConstrainDevices=yes in
cgroup.conf) then that should render them inacc
On 5/27/21 12:26 pm, Prentice Bisbal wrote:
Given the lack of traffic on the mailing list and lack of releases, I'm
beginning to think that both of these project are all but abandoned.
They're definitely actively working on it - I've given them a heads up
on this to let them know how it's bei
On 5/24/21 3:02 am, Mark Dixon wrote:
Does anyone have advice on automatically draining a node in this
situation, please?
We do some health checks via a node epilog set with the "Epilog"
setting, including queueing node reboots with "scontrol reboot".
All the best,
Chris
--
Chris Samuel
On 5/19/21 1:41 pm, Tim Carlson wrote:
but I still don't understand how with "shared=exclusive" srun gives one
result and sbatch gives another.
I can't either, but I can reproduce it with Slurm 20.11.7. :-/
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 5/19/21 9:15 pm, Herc Silverstein wrote:
Does anyone have an idea of what might be going on?
To add to the other suggestions, I would say that checking the slurmctld
and slurmd logs to see what it is saying is wrong is a good place to start.
Best of luck,
Chris
--
Chris Samuel : http
On 5/14/21 1:45 am, Diego Zuccato wrote:
Usage reported in Percentage of Total
Cluster TRES Name Allocated Down PLND Dow Idle
Reserved Reported
- --
On 5/14/21 1:45 am, Diego Zuccato wrote:
It just doesn't recognize 'ALL'. It works if I specify the resources.
That's odd, what does this say?
sreport --version
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
On 5/13/21 3:08 pm, Sid Young wrote:
Hi All,
Hiya,
Is there a way to define an effective "usage rate" of a HPC Cluster
using the data captured in the slurm database.
Primarily I want to see if it can be helpful in presenting to the
business a case for buying more hardware for the HPC :)
Hi Robert,
On 4/16/21 12:39 pm, Robert Peck wrote:
Please can anyone suggest how to instruct SLURM not to massacre ALL my
jobs because ONE (or a few) node(s) fails?
You will also probably want this for your srun: --kill-on-bad-exit=0
What does the scontrol command below show?
scontrol show
On 4/7/21 11:48 am, Administração de Sistemas do Centro de
Bioinformática wrote:
Unfortunately, I still don't know how to use any other value to
PartitionName.
We've got about 20 different partitions on our large Cray system, with a
variety of names (our submit filter system directs jobs to
On 2/9/21 5:08 pm, Paul Edmon wrote:
1. Being on the latest release: A lot of work has gone into improving
RPC throughput, if you aren't running the latest 20.11 release I highly
recommend upgrading. 20.02 also was pretty good at this.
We've not gone to 20.11 on production systems yet, but I
On 1/27/21 9:28 pm, Chandler wrote:
Hi list, we have a new cluster setup with Bright cluster manager.
Looking into a support contract there, but trying to get community
support in the mean time. I'm sure things were working when the cluster
was delivered, but I provisioned an additional node
On 1/24/21 8:39 am, Paul Raines wrote:
I think you have identified the issue here or are very close. My
gres.conf on
the rtx-04 node for example is:
AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Cores=0-15
[...]
Ah - you are doing both autodiscovery here and also specifyin
On 1/26/21 12:10 pm, Ole Holm Nielsen wrote:
What I don't understand is, is it actually *required* to make the NVIDIA
libraries available to Slurm? I didn't do that, and I'm not aware of
any problems with our GPU nodes so far. Of course, our GPU nodes have
the libraries installed and the /de
On 12/18/20 4:45 am, Tina Friedrich wrote:
Yeah, I had that problem as well (trying to set up a partition that
didn't have any nodes - they're not here yet).
You can define nodes in Slurm that don't exist yet with State=FUTURE,
that means slurmctld basically ignores them until you change that
On 12/14/20 11:20 pm, Alpha Experiment wrote:
It is called using the following submission script:
#!/bin/bash
#SBATCH --partition=full
#SBATCH --job-name="Large"
source testenv1/bin/activate
python3 multithread_example.py
You're not asking for a number of cores, so you'll likely only be
getti
Hi Drew,
On 12/4/20 11:32 am, Mullen, Drew wrote:
Error: Package: slurm-20.02.4-1.amzn2.x86_64 (/slurm-20.02.4-1.amzn2.x86_64)
Requires: libnvidia-ml.so.1()(64bit
That looks like it's fixed in 20.02.5 (the current release is 20.02.6):
---
Hi Kevin,
On 11/4/20 6:00 pm, Kevin Buckley wrote:
In looking at the SlurmCtlD log we see pairs of lines as follows
update_node: node nid00245 reason set to: slurm.conf
update_node: node nid00245 state set to DRAINED
I'd go looking in your healthcheck scripts, I took a quick look at the
Hi Navin,
On 11/4/20 10:14 pm, navin srivastava wrote:
I have already built a new server slurm 20.2 with the latest DB. my
question is, shall i do a mysqldump into this server from existing
server running with version slurm version 17.11.8
This won't work - you must upgrade your 17.11 datab
On 10/28/20 6:27 am, Diego Zuccato wrote:
Strangely the core file seems corrupted (maybe because it's from a
4-nodes job and they all try to write to the same file?):
You can set a pattern for core file names to prevent that, usually the
PID is in the name, but you can put the hostname in the
Hi Paul,
On 10/23/20 10:13 am, Paul Raines wrote:
Any clues as to why pam_slurm_adopt thinks there is no job?
Do you have PrologFlags=Contain in your slurm.conf?
Contain
At job allocation time, use the ProcTrack plugin to create a job
container on all allocated compute nodes. This co
On 10/21/20 6:32 pm, Kevin Buckley wrote:
If you install SLES 15 SP1 from the Q2 ISOs so that you have Munge but
not the Slurm 18 that comes on the media, and then try to "rpmbuild -ta"
against a vanilla Slurm 20.02.5 tarball, you should get the error I did.
Ah, yes, that looks like it was a p
On 10/22/20 12:20 pm, Burian, John wrote:
This doesn' t help you now, but Slurm 20.11 is expected to have "magnetic
reservations," which are reservations that will adopt jobs that don't specify a
reservation but otherwise meet the restrictions of the reservation:
Magnetic reservations are in
On 10/20/20 12:49 am, Kevin Buckley wrote:
only have, as listed before, Munge 0.5.13.
I guess the question is (going back to your initial post):
> error: Failed build dependencies:
>munge-libs is needed by slurm-20.02.5-1.x86_64
Had you installed libmunge2 before trying this build?
On 10/19/20 7:15 pm, Kevin Buckley wrote:
[...]
Just out of interest though, when you built yours on CLE7.0 UP01, what
provided the munge: the vannila SLES munge, or a Cray munge ?
It's cray-munge for CLE7 UP01.
Thanks for the explanation of what you've been running through!
I forgot I do ha
Hi Sajesh,
On 10/8/20 4:18 pm, Sajesh Singh wrote:
Thank you for the tip. That works as expected.
No worries, glad it's useful. Do be aware that the core bindings for the
GPUs would likely need to be adjusted for your hardware!
Best of luck,
Chris
--
Chris Samuel : http://www.csamuel
On 10/8/20 3:48 pm, Sajesh Singh wrote:
Thank you. Looks like the fix is indeed the missing file
/etc/slurm/cgroup_allowed_devices_file.conf
No, you don't want that, that will allow all access to GPUs whether
people have requested them or not.
What you want is in gres.conf and looks lik
Hi Sajesh,
On 10/8/20 11:57 am, Sajesh Singh wrote:
debug: common_gres_set_env: unable to set env vars, no device files
configured
I suspect the clue is here - what does your gres.conf look like?
Does it list the devices in /dev for the GPUs?
All the best,
Chris
--
Chris Samuel : http:/
On 8/14/20 6:17 am, Stefan Staeglich wrote:
what's the current status of the checkpointing support in SLURM?
There isn't any these days, there used to be support for BLCR but that's
been dropped as BLCR is no more.
I know from talking with SchedMD they are of the opinion that any
current c
On 8/6/20 10:13 am, Jason Simms wrote:
Later this month, I will have to bring down, patch, and reboot all nodes
in our cluster for maintenance. The two options available to set nodes
into a maintenance mode seem to be either: 1) creating a system-wide
reservation, or 2) setting all nodes into
On 7/26/20 12:21 pm, Paul Raines wrote:
Thank you so much. This also explains my GPU CUDA_VISIBLE_DEVICES missing
problem in my previous post.
I've missed that, but yes, that would do it.
As a new SLURM admin, I am a bit suprised at this default behavior.
Seems like a way for users to game
1 - 100 of 237 matches
Mail list logo