from:"Timony, Mick"

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

2025-05-12 Thread Timony, Mick via slurm-users

We do something very similar at HMS. For instance our nodes with 257468MB of RAM we round down RealMemory to 257000MB, for nodes with 1031057MB of RAM we round down to 100 etc. We may tune this on our next OS and Slurm update as I expect to see more memory used by the OS as we migrating to

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Timony, Mick via slurm-users

At HMS we do the same as Paul's cluster and specify the groups we want to have access to all our compute nodes, we allow two groups that represent our DevOps team and our Research Computing consultants to have access and then corresponding sudo rules for each group to allow different command se

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

2024-07-09 Thread Timony, Mick via slurm-users

Hi Paul, There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help: https://slurm.schedmd.com/high_throughput.html The output of the following command might help: scontrol show job 465072 Regards -- Mick Timony Se

[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Timony, Mick via slurm-users

We set SlurmdTimeout=600. The docs say not to go any higher than 65533 seconds: https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout The FAQ has info about SlurmdTimeout also. The worst thing that could happen is will take longer to set nodes as being down: >A node is set DOWN when the s

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

2024-01-09 Thread Timony, Mick

You could enable debug logging on your slurm controllers to see if that provides some more useful info. I'd also check your firewall settings to make sure your not blocking some traffic that you shouldn't. iptables -F will clear your local Linux firewall. I'd also triple check the UID on all t

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

2024-01-08 Thread Timony, Mick

This ticket with SchedMD implies it's a munged issue: https://bugs.schedmd.com/show_bug.cgi?id=1293 Is the munge daemon running on all systems? If it is, are all servers running a network time daemon such chronyd or ntpd and the time is in sync on all hosts? Regards --Mick _

Re: [slurm-users] Correct way to do logrotation

2023-10-17 Thread Timony, Mick

Schedmd has docs about how to do this at: https://slurm.schedmd.com/slurm.conf.html#SECTION_LOGGING Our config at HMS looks like this: /var/log/slurm/slurmctld.log { create 0640 slurm root daily dateext nocompress notifempty rotate 10 sharedscripts postrotate /bin/pkill -x

Re: [slurm-users] Nodes stay drained no matter what I do

2023-08-24 Thread Timony, Mick

Hi Patrick, You may want to review the release notes for 19.05 and any intermediate versions: https://github.com/SchedMD/slurm/blob/slurm-19-05-5-1/RELEASE_NOTES https://github.com/SchedMD/slurm/blob/slurm-18-08-9-1/RELEASE_NOTES I'd also check the slurmd.log on the compute nodes. It's usuall

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Timony, Mick

unpack SLURM_PERSIST_INIT message Regards, Wadud. ________ From: slurm-users on behalf of Timony, Mick Sent: 08 September 2022 16:24 To: Slurm User Community List Subject: Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9 CAUTION: This e-mail originated outside the Univer

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Timony, Mick

This thread on the forums may help: https://groups.google.com/g/slurm-users/c/YB55Ru9rvD4 It looks like you have something on your network with an older version of slurm installed. I'd check the Slurm version installed on your compute nodes and controllers. The recommended approach to upgradi

Re: [slurm-users] Problems with cgroupsv2

2022-08-16 Thread Timony, Mick

When I see odd behaviour I've found it sometimes related to either NTP issues (the time is off) or munge errors: * Is NTP running and is the time accurate * Look for munge errors: * /var/log/munge/munged.log * sudo systemctl status munge If it's a munge error, usually resta

Re: [slurm-users] SlurmDB Archive settings?

2022-07-15 Thread Timony, Mick

That's great advice. Thank you Ole. --Mick From: slurm-users on behalf of Ole Holm Nielsen Sent: Friday, July 15, 2022 2:04 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] SlurmDB Archive settings? On 7/14/22 18:49, Timony, Mick

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Timony, Mick

What I can tell you is that we have never had a problem reimporting the data back in that was dumped from older versions into a current version database. So the import using sacctmgr must do the conversion from the older formats to the newer formats and handle the schema changes. That's the

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Timony, Mick

Hi Paul If you have 6 years worth of data and you want to prune down to 2 years, I recommend going month by month rather than doing it in one go. When we initially started archiving data several years back our first pass at archiving (which at that time had 2 years of data in it) took forever

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Timony, Mick

Hi Ole, Which database server and version do you run, MySQL or MariaDB? What's your Slurm version? mariadb 5.5.68 and a patched version of slurm 21.08.7 Did you already make appropriate database purges to reduce the size? I have some notes in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slu

[slurm-users] SlurmDB Archive settings?

2022-07-13 Thread Timony, Mick

Hi Slurm Users, Currently we don't archive our SlurmDB and have 6 years' worth of data in our SlurmDB. We are looking to start archiving our database as it starting to get rather large, and we have decided to keep 2 years' worth of data. I'm wondering what approaches or scripts other groups use

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Timony, Mick

I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset. I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows:

[slurm-users] Nvidia virtual GPU (vGPU) and Slurm?

2020-07-01 Thread Timony, Mick

Hi, I've been considering new purchasing NVidia RTX6000 or the RTX8000 NVidia GPU's to add to our existing GPU's partitons on our Slurm cluster. The RTX6000 has 24GB of on-board memory and the RTX8000 has 48GB, both of these are single-precision cards. Besides the additional 24GB of memory th

[slurm-users] Re: Do I have to hold back RAM for worker nodes?

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

[slurm-users] Re: Job submitted to multiple partitions not running when any partition is full

[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

Re: [slurm-users] DBD_SEND_MULT_MSG - invalid uid error

Re: [slurm-users] Correct way to do logrotation

Re: [slurm-users] Nodes stay drained no matter what I do

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

Re: [slurm-users] Problems with cgroupsv2

Re: [slurm-users] SlurmDB Archive settings?

Re: [slurm-users] SlurmDB Archive settings?

Re: [slurm-users] SlurmDB Archive settings?

Re: [slurm-users] SlurmDB Archive settings?

[slurm-users] SlurmDB Archive settings?

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

[slurm-users] Nvidia virtual GPU (vGPU) and Slurm?

18 matches

Site Navigation

Mail list logo

Footer information