Hi,
A follow-up. I though some of nodes were ok but that's not the case;
This morning, another pool of consecutive (why consecutive by the way?
they are always consecutively failing) compute nodes are idle* . And now
of the nodes which were drained came back to life in idle and now again
swit
There aren't many mods in my slurm conf.
Since the priority/multifactor with PriorityWeightTres is already
active, it would be possible to take QOS.
Could you give a configuration example?
For example, jobs could occupy 1-128GB, i.e. categorization of ...16,
32, 64, 128 is necessary?
Two categori
At least from our experience, the default user share within an account is 1, so
they'd all stay at the same share within that account. Except for the one
faculty who wanted a much higher share than the students within their account,
I've never had to modify shares for any users otherwise. So add
Thanks for the help!
Is it possible to use FairTree
(https://slurm.schedmd.com/fair_tree.html) to ensure that all users
always have equal fairshare. On this account, we have users coming and
going relatively often and having fairshare automatically adjusted would
simplify the administration.
Brian, Bjorn, thank you for your answers;
- From every compute node, I checked I could nslookup some other compute
nodes as well as the slurm master for their hostnames; That worked;
In the mean time we identified other issues . Apparently, that solved the
problem for part of the nodes (kyle
That looks like a DNS issue.
Verify all your nodes are able to resolve the names of each other.
Check /etc/resolv.conf, /etc/hosts and /etc/slurm/slurm.conf on the
nodes (including head/login nodes) to ensure they all match.
Brian Andrus
On 2/1/2022 1:37 AM, Jeremy Fix wrote:
Hello everyon
First, thanks Tim for the nvidia-smi 'drain' pointer. That works
but I will still confused why what I did did not work
But Esben's reference explains it though I think the default
behavior very wierd in this case. I would think SLURM itself
should default things to CUDA_DEVICE_ORDER=PCI_BUS_ID
Hi ,
My slurm version is 20.11.5 .
I use job_container/tmpfs to set up a private /tmp ,but the permission is
700. Normal user can not read or write.
drwx-- 2 root root 6 Jan 31 01:32 tmp
[cid:85aa8ece-e895-4948-b2ec-98852a1f6b1e]
slurm.conf
JobContainerType=jo
This might not apply to your setup, but historically when we've seen
similar behaviour, it was often due to the affected compute nodes
missing from /etc/hosts on some *other* compute nodes.
--
B/H
signature.asc
Description: PGP signature
Hi,
I am wondering if this possible with slurm, I have an application where I want
to create groups of nodes (group size would be between 1 and n servers) which
have exclusive access to a shared resources and then on that group of nodes
allow a configurable amount of jobs to run.
For example
Hello experts,
I hope someone is out there having some experience with the
"ActiveFeatures" and "AvailableFeatures" in the node configuration and
can give some advise.
We have configured 4 nodes with certain features, e.g.
"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
CPUAlloc=0 CPUTot=96
Hello everyone,
we are facing a weird issue. On a regular basis, some compute nodes go
from *idle* -> *idle** -> *down* and loop back to idle on its own; The
slurm manages several nodes and this state cycle appears only for some
pools of nodes.
We get a trace on the compute node as :
[2022
12 matches
Mail list logo