Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-02 Thread Stephan Roth
On 02.02.22 18:32, Michael Di Domenico wrote: On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in /proc/driver/nvidia/gpus/*/information The serial number isn't sh

[slurm-users] job_container/tmpfs mounts a private /tmp but the permission is root 700.Normal user can not read or write.

2022-02-02 Thread William Zhang
Hi , My slurm version is 20.11.5 . I use job_container/tmpfs to set up a private /tmp ,but the permission is 700. Normal user can not read or write. drwx-- 2 root root 6 Jan 31 01:32 tmp [cid:06118f60-5631-4735-a6f1-d80480cfb2cf] I think the permission should be 70

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-02 Thread Brian Andrus
I actually just did that path for a children's hospital. It was fairly straight-forward. Running jobs were not affected. You do need to go 17->18->19->20->21 This is because there were changes in the db schema. If you plan on bringing everything to a stop (no running jobs), you should be good

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-02 Thread Brian Haymore
Are you running slurmdbd in your current setup? If you are then the upgrade path there might have additional considerations moving this far in versions. -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558

[slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-02 Thread Nathan Smith
The "Upgrades" section of the quick-start guide [0] warns: > Slurm permits upgrades to a new major release from the past two major > releases, which happen every nine months (e.g. 20.02.x or 20.11.x to > 21.08.x) without loss of jobs or other state information. State > information from older ver

[slurm-users] Fwd: Using PreemptExemptTime

2022-02-02 Thread Phil Kauffman
Does anyone have a working example using PreemptExemptTime? My goal is to make a higher priority job wait 24 hours before actually preempting a lower priority job. Another way, any job is entitled to 24 hours run time before being preempted. The preempted job should be suspended, ideally. If r

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-02 Thread Jeremy Fix
Hello , Thank you for your suggestion and I thank also thank Tina; To answer your question, there is no TreeWidth entry in the slurm.conf But it seems we figured out the issue and I'm so sorry we did not think about it : we already had a pool of 48 nodes on the master but their slurm.conf

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-02 Thread Michael Di Domenico
On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: > The problem is to identify the cards physically from the information we > have, like what's reported with nvidia-smi or available in > /proc/driver/nvidia/gpus/*/information > The serial number isn't shown for every type of GPU and I'm not sure

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-02 Thread Stephen Cousins
Hi Jeremy, What is the value of TreeWidth in your slurm.conf? If there is no entry then I recommend setting it to a value a bit larger than the number of nodes you have in your cluster and then restarting slurmctld. Best, Steve On Wed, Feb 2, 2022 at 12:59 AM Jeremy Fix wrote: > Hi, > > A fol

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-02 Thread Tina Friedrich
Hi Jeremy, I haven't got anything very intelligent to contribute to solve your problem. However, what I can tell you is that we run our production cluster with one SLURM master running on a virtual machine handling just over 300 nodes. We have never seen the sort of problem you have other than