[slurm-users] problem running slurm

2020-02-06 Thread Hector Yuen
Hello, I am setting up a very simple configuration: one node running slurmd and another one running slurmctld. In the slurmctld machine I run: srun -v -p debug bash -i And get this output srun: defined options srun: srun: partition : debug sr

[slurm-users] Which ports does slurm use?

2020-02-06 Thread Dean Schulze
I've moved two nodes to a different controller. The nodes are wired and the controller is networked via wifi. I had to open up ports 6817 and 6818 between the wired and wireless sides of our network to get any connectivity. Now when I do srun -N2 hostname the jobs show connection timeouts on t

[slurm-users] srun fails for certain partitions, but works with sbatch

2020-02-06 Thread Li, Yee Ting
Hi, i have a node that is registered in two separate partitions: "ML" and "shared". i'm running 19-05-5-1. everything in batch works well; i have users submitting into shared partition with QoS "scavenger" which is pre-emptable by "normal" QoS submissions. the default QoS is set up to be "scav

[slurm-users] Nodes stuck in drain state and sending Invalid Argument every second

2020-02-06 Thread Dean Schulze
I moved two nodes to another controller and the two nodes will not come out of the drain state now. I've rebooted the hosts but they are still stuck in the drain state. There is nothing in the location given for saving state so I can't understand why a reboot doesn't clear this. Here's the node

Re: [slurm-users] SLURM starts new job before CG finishes

2020-02-06 Thread Paddy Doyle
Hi James, Just for a slightly different take, 2-3 minutes seems a bit long for an epilog script. Do you need to run all of those checks after every job? Also, you describe it as running health checks; why not run those checks via the HealthCheckProgram every HealthCheckInterval (e.g. 1 hour)? Or

[slurm-users] Query regarding energy from TRES in scontrol output

2020-02-06 Thread Ravi Reddy Manumachu
Dear Slurm Users, I have a question regarding the energy consumption of a node during application execution. Following is the output from a job executed on one node. *$ scontrol show job 13513617* JobId=13513617 JobName=dgemm ... Jo

Re: [slurm-users] Limits to partitions for users groups

2020-02-06 Thread Рачко Антон Сергеевич
Strange. I set "limited" QoS with limit of cpu=560 and apply it to my queue. [root@head ~]# sacctmgr show qos \ Name Priority GraceTimePreempt PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit G