Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Jeremy Fix
Hi, A follow-up. I though some of nodes were ok but that's not the case; This morning, another pool of consecutive (why consecutive by the way? they are always consecutively failing) compute nodes are idle* . And now of the nodes which were drained came back to life in idle and now again swit

Re: [slurm-users] how to allocate high priority to low cpu and memory jobs

2022-02-01 Thread z148x
There aren't many mods in my slurm conf. Since the priority/multifactor with PriorityWeightTres is already active, it would be possible to take QOS. Could you give a configuration example? For example, jobs could occupy 1-128GB, i.e. categorization of ...16, 32, 64, 128 is necessary? Two categori

Re: [slurm-users] Fairshare within a single Account (Project)

2022-02-01 Thread Renfro, Michael
At least from our experience, the default user share within an account is 1, so they'd all stay at the same share within that account. Except for the one faculty who wanted a much higher share than the students within their account, I've never had to modify shares for any users otherwise. So add

Re: [slurm-users] Fairshare within a single Account (Project)

2022-02-01 Thread Tomislav Maric
Thanks for the help! Is it possible to use FairTree  (https://slurm.schedmd.com/fair_tree.html) to ensure that all users always have equal fairshare. On this account, we have users coming and going relatively often and having fairshare automatically adjusted would simplify the administration.

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Jeremy Fix
Brian, Bjorn, thank you for your answers; - From every compute node, I checked I could nslookup some other compute nodes as well as the slurm master for their hostnames; That worked; In the mean time we identified other issues . Apparently, that solved the problem for part of the nodes (kyle

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Brian Andrus
That looks like a DNS issue. Verify all your nodes are able to resolve the names of each other. Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the nodes (including head/login nodes) to ensure they all match. Brian Andrus On 2/1/2022 1:37 AM, Jeremy Fix wrote: Hello everyon

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-01 Thread Paul Raines
First, thanks Tim for the nvidia-smi 'drain' pointer. That works but I will still confused why what I did did not work But Esben's reference explains it though I think the default behavior very wierd in this case. I would think SLURM itself should default things to CUDA_DEVICE_ORDER=PCI_BUS_ID

[slurm-users] job_container/tmpfs mounts a private /tmp but the permission is 700

2022-02-01 Thread 张 宇超
Hi , My slurm version is 20.11.5 . I use job_container/tmpfs to set up a private /tmp ,but the permission is 700. Normal user can not read or write. drwx-- 2 root root 6 Jan 31 01:32 tmp [cid:85aa8ece-e895-4948-b2ec-98852a1f6b1e] slurm.conf JobContainerType=jo

Re: [slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Bjørn-Helge Mevik
This might not apply to your setup, but historically when we've seen similar behaviour, it was often due to the affected compute nodes missing from /etc/hosts on some *other* compute nodes. -- B/H signature.asc Description: PGP signature

[slurm-users] Creating groups of nodes with exclusive access to a resources within a partition.

2022-02-01 Thread Rich Cardwell
Hi, I am wondering if this possible with slurm, I have an application where I want to create groups of nodes (group size would be between 1 and n servers) which have exclusive access to a shared resources and then on that group of nodes allow a configurable amount of jobs to run. For example

[slurm-users] ActiveFeatures job submission

2022-02-01 Thread Alexander Block
Hello experts, I hope someone is out there having some experience with the "ActiveFeatures" and "AvailableFeatures" in the node configuration and can give some advise. We have configured 4 nodes with certain features, e.g. "NodeName=thin1 Arch=x86_64 CoresPerSocket=24    CPUAlloc=0 CPUTot=96

[slurm-users] Compute nodes cycling from idle to down on a regular basis ?

2022-02-01 Thread Jeremy Fix
Hello everyone, we are facing a weird issue. On a regular basis, some compute nodes go from *idle* -> *idle** -> *down* and loop back to idle on its own;  The slurm manages several nodes and this state cycle appears only for some pools of nodes. We get a trace on the compute node as : [2022