At HMS we do the same as Paul's cluster and specify the groups we want to have
access to all our compute nodes, we allow two groups that represent our DevOps
team and our Research Computing consultants to have access and then
corresponding sudo rules for each group to allow different command se
Hi Paul,
There could be multiple reasons why the job isn't running, from the user's QOS
to your cluster hitting MaxJobCount. This page might help:
https://slurm.schedmd.com/high_throughput.html
The output of the following command might help:
scontrol show job 465072
Regards
--
Mick Timony
Se
We set SlurmdTimeout=600. The docs say not to go any higher than 65533 seconds:
https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout
The FAQ has info about SlurmdTimeout also. The worst thing that could happen is
will take longer to set nodes as being down:
>A node is set DOWN when the s
You could enable debug logging on your slurm controllers to see if that
provides some more useful info. I'd also check your firewall settings to make
sure your not blocking some traffic that you shouldn't. iptables -F will clear
your local Linux firewall.
I'd also triple check the UID on all t
This ticket with SchedMD implies it's a munged issue:
https://bugs.schedmd.com/show_bug.cgi?id=1293
Is the munge daemon running on all systems? If it is, are all servers running a
network time daemon such chronyd or ntpd and the time is in sync on all hosts?
Regards
--Mick
_
Schedmd has docs about how to do this at:
https://slurm.schedmd.com/slurm.conf.html#SECTION_LOGGING
Our config at HMS looks like this:
/var/log/slurm/slurmctld.log {
create 0640 slurm root
daily
dateext
nocompress
notifempty
rotate 10
sharedscripts
postrotate
/bin/pkill -x
Hi Patrick,
You may want to review the release notes for 19.05 and any intermediate
versions:
https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES
https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES
I'd also check the slurmd.log on the compute nodes. It's usuall
unpack SLURM_PERSIST_INIT
message
Regards,
Wadud.
________
From: slurm-users on behalf of Timony,
Mick
Sent: 08 September 2022 16:24
To: Slurm User Community List
Subject: Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9
CAUTION: This e-mail originated outside the Univer
This thread on the forums may help:
https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4
It looks like you have something on your network with an older version of slurm
installed. I'd check the Slurm version installed on your compute nodes and
controllers.
The recommended approach to upgradi
When I see odd behaviour I've found it sometimes related to either NTP issues
(the time is off) or munge errors:
* Is NTP running and is the time accurate
* Look for munge errors:
* /var/log/munge/munged.log
* sudo systemctl status munge
If it's a munge error, usually resta
That's great advice. Thank you Ole.
--Mick
From: slurm-users on behalf of Ole Holm
Nielsen
Sent: Friday, July 15, 2022 2:04 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] SlurmDB Archive settings?
On 7/14/22 18:49, Timony, Mick
What I can tell you is that we have never had a problem reimporting the data
back in that was dumped from older versions into a current version database.
So the import using sacctmgr must do the conversion from the older formats to
the newer formats and handle the schema changes.
That's the
Hi Paul
If you have 6 years worth of data and you want to prune down to 2 years, I
recommend going month by month rather than doing it in one go. When we
initially started archiving data several years back our first pass at archiving
(which at that time had 2 years of data in it) took forever
Hi Ole,
Which database server and version do you run, MySQL or MariaDB? What's
your Slurm version?
mariadb 5.5.68 and a patched version of slurm 21.08.7
Did you already make appropriate database purges to reduce the size? I
have some notes in my Wiki page
https://wiki.fysik.dtu.dk/niflheim/Slu
Hi Slurm Users,
Currently we don't archive our SlurmDB and have 6 years' worth of data in our
SlurmDB. We are looking to start archiving our database as it starting to get
rather large, and we have decided to keep 2 years' worth of data. I'm wondering
what approaches or scripts other groups use
I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.
I want jobs to avoid that card as well as the card it is NVLINK'ed to.
So I modified gres.conf on that node as follows:
Hi,
I've been considering new purchasing NVidia RTX6000 or the RTX8000 NVidia
GPU's to add to our existing GPU's partitons on our Slurm cluster.
The RTX6000 has 24GB of on-board memory and the RTX8000 has 48GB, both of these
are single-precision cards. Besides the additional 24GB of memory th
17 matches
Mail list logo