[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Timony, Mick via slurm-users
At HMS we do the same as Paul's cluster and specify the groups we want to have access to all our compute nodes, we allow two groups that represent our DevOps team and our Research Computing consultants to have access and then corresponding sudo rules for each group to allow different command se

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Timony, Mick via slurm-users
Hi Paul, There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help: https://slurm.schedmd.com/high_throughput.html The output of the following command might help: scontrol show job 465072​ Regards -- Mick Timony Se

[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Timony, Mick via slurm-users
We set SlurmdTimeout=600​. The docs say not to go any higher than 65533 seconds: https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout The FAQ has info about SlurmdTimeout also. The worst thing that could happen is will take longer to set nodes as being down: >A node is set DOWN when the s

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

2024-01-09 Thread Timony, Mick
You could enable debug logging on your slurm controllers to see if that provides some more useful info. I'd also check your firewall settings to make sure your not blocking some traffic that you shouldn't. iptables -F​ will clear your local Linux firewall. I'd also triple check the UID on all t

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

2024-01-08 Thread Timony, Mick
This ticket with SchedMD implies it's a munged issue: https://bugs.schedmd.com/show_bug.cgi?id=1293 Is the munge daemon running on all systems? If it is, are all servers running a network time daemon such chronyd or ntpd and the time is in sync on all hosts? Regards --Mick _

Re: [slurm-users] Correct way to do logrotation

2023-10-17 Thread Timony, Mick
Schedmd has docs about how to do this at: https://slurm.schedmd.com/slurm.conf.html#SECTION_LOGGING Our config at HMS looks like this: /var/log/slurm/slurmctld.log { create 0640 slurm root daily dateext nocompress notifempty rotate 10 sharedscripts postrotate /bin/pkill -x

Re: [slurm-users] Nodes stay drained no matter what I do

2023-08-24 Thread Timony, Mick
Hi Patrick, You may want to review the release notes for 19.05 and any intermediate versions: https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES I'd also check the slurmd.log​ on the compute nodes. It's usuall

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Timony, Mick
unpack SLURM_PERSIST_INIT message Regards, Wadud. ________ From: slurm-users on behalf of Timony, Mick Sent: 08 September 2022 16:24 To: Slurm User Community List Subject: Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9 CAUTION: This e-mail originated outside the Univer

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Timony, Mick
This thread on the forums may help: https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4 It looks like you have something on your network with an older version of slurm installed. I'd check the Slurm version installed on your compute nodes and controllers. The recommended approach to upgradi

Re: [slurm-users] Problems with cgroupsv2

2022-08-16 Thread Timony, Mick
When I see odd behaviour I've found it sometimes related to either NTP issues (the time is off) or munge errors: * Is NTP running and is the time accurate * Look for munge errors: * /var/log/munge/munged.log * sudo systemctl status munge If it's a munge error, usually resta

Re: [slurm-users] SlurmDB Archive settings?

2022-07-15 Thread Timony, Mick
That's great advice. Thank you Ole. --Mick From: slurm-users on behalf of Ole Holm Nielsen Sent: Friday, July 15, 2022 2:04 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] SlurmDB Archive settings? On 7/14/22 18:49, Timony, Mick

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Timony, Mick
What I can tell you is that we have never had a problem reimporting the data back in that was dumped from older versions into a current version database. So the import using sacctmgr must do the conversion from the older formats to the newer formats and handle the schema changes. ​That's the

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Timony, Mick
Hi Paul If you have 6 years worth of data and you want to prune down to 2 years, I recommend going month by month rather than doing it in one go. When we initially started archiving data several years back our first pass at archiving (which at that time had 2 years of data in it) took forever

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Timony, Mick
​Hi Ole, Which database server and version do you run, MySQL or MariaDB? What's your Slurm version? ​mariadb 5.5.68 and a patched version of slurm 21.08.7 Did you already make appropriate database purges to reduce the size? I have some notes in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slu

[slurm-users] SlurmDB Archive settings?

2022-07-13 Thread Timony, Mick
Hi Slurm Users, Currently we don't archive our SlurmDB and have 6 years' worth of data in our SlurmDB. We are looking to start archiving our database as it starting to get rather large, and we have decided to keep 2 years' worth of data. I'm wondering what approaches or scripts other groups use

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Timony, Mick
I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset. I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows:

[slurm-users] Nvidia virtual GPU (vGPU) and Slurm?

2020-07-01 Thread Timony, Mick
Hi, I've been considering new purchasing NVidia RTX6000 or the RTX8000 NVidia GPU's to add to our existing GPU's partitons on our Slurm cluster. The RTX6000 has 24GB of on-board memory and the RTX8000 has 48GB, both of these are single-precision cards. Besides the additional 24GB of memory th