Hello,
I am setting up a very simple configuration: one node running slurmd and
another one running slurmctld.
In the slurmctld machine I run:
srun -v -p debug bash -i
And get this output
srun: defined options
srun:
srun: partition : debug
sr
I've moved two nodes to a different controller. The nodes are wired and
the controller is networked via wifi. I had to open up ports 6817 and 6818
between the wired and wireless sides of our network to get any connectivity.
Now when I do
srun -N2 hostname
the jobs show connection timeouts on t
Hi,
i have a node that is registered in two separate partitions: "ML" and "shared".
i'm running 19-05-5-1.
everything in batch works well; i have users submitting into shared partition
with QoS "scavenger" which is pre-emptable by "normal" QoS submissions. the
default QoS is set up to be "scav
I moved two nodes to another controller and the two nodes will not come out
of the drain state now. I've rebooted the hosts but they are still stuck
in the drain state. There is nothing in the location given for saving
state so I can't understand why a reboot doesn't clear this.
Here's the node
Hi James,
Just for a slightly different take, 2-3 minutes seems a bit long for an
epilog script. Do you need to run all of those checks after every job?
Also, you describe it as running health checks; why not run those checks
via the HealthCheckProgram every HealthCheckInterval (e.g. 1 hour)?
Or
Dear Slurm Users,
I have a question regarding the energy consumption of a node during
application execution.
Following is the output from a job executed on one node.
*$ scontrol show job 13513617*
JobId=13513617 JobName=dgemm
...
Jo
Strange. I set "limited" QoS with limit of cpu=560 and apply it to my queue.
[root@head ~]# sacctmgr show qos
\ Name Priority GraceTimePreempt PreemptMode
Flags UsageThres UsageFactor GrpTRES GrpTRESMins
GrpTRESRunMin GrpJobs GrpSubmit G