Hi Chris,
thanks for the detailed feedback. This is slurm 18.08.5, see also
https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/Dockerfile#L9
Best,
Peter
smime.p7s
Description: S/MIME Cryptographic Signature
Hi Chris,
thanks for following up on this thread.
First of all, you will want to use cgroups to ensure that processes that do
not request GPUs cannot access them.
We had a feeling that cgroups might be more optimal. Could you point us
to documentation that suggests cgroups to be a requireme
Just to follow up, I filed a medium bug report with schedmd on this:
https://bugs.schedmd.com/show_bug.cgi?id=6763
Best,
Peter
On 3/25/19 10:30 AM, Peter Steinbach wrote:
Dear all,
Using these config files,
https://github.com/psteinb/docker-centos7-slurm/blob
Dear all,
Using these config files,
https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/gres.conf
https://github.com/psteinb/docker-centos7-slurm/blob/7bdb89161febacfd2dbbcb3c5684336fb73d7608/slurm.conf
I observed a weird behavior of the '--gres-flags=
After more tests, the situation clears a bit.
If "COREs=0,1" (etc) is present in the `gres.conf` file, then one can
inject gres jobs on a single core only by using
`--gres-flags=disable-bindung` if a non-gres job is running the same node.
If "COREs=0,1" is NOT present in `gres.conf`. then any
Interesting enough, if I add Cores=0-1 and Cores=2-3 to the gres.conf
file, everything stops working again. :/ Should I send around scontrol
outputs? And yes, I watched out to set the --mem flag for the job
submission this time.
Best,
Peter
smime.p7s
Description: S/MIME Cryptographic Signat
Hi Philippe,
thanks for spotting this. This indeed appears to solve this first issue.
Now I can try to make the GPUs available and play with pinning etc.
Superb - if you happen to be at ISC, let me know. I'd buy you a
coffee/beer! ;)
Peter
smime.p7s
Description: S/MIME Cryptographic Sign
Hi Chris,
I changed the initial state a bit (the number of cores per node was
misconfigured):
https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/slurm.conf
But that doesn't change things. Initially, I see this:
# sinfo -N -l
Wed Mar 20 09:03:26 2019
NODELIST NO
Hi Benson,
As you can perhaps see from our slurm.conf, we have task affinity or similar
switches off. Along the same route, i also removed the core binding of the
GPUs. That is why, I am quite surprised, that slurm doesn’t allow new jobs in.
I am aware of the PCIe bandwidth implications of a GP
I've read through the parameters. I am not sure if any of those would
help in our situation. What suggestions would you make? Note, it's not
the scheduler policy that appears to hinder us. It's about how slurm
keeps track of the generic resource and (potentially) binds it to
available cores. Th
Dear Eli,
thanks for your reply. The slurm.conf file I suggested lists this
parameter. We use
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
See also:
https://github.com/psteinb/docker-centos7-slurm/blob/18.08.5-with-gres/slurm.conf#L60
I'll check if that makes a difference.
Hi,
we are struggling with a slurm 18.08.5 installation of ours. We are in a
situation, where our GPU nodes have a considerable number of cores but
"only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs
can enter alright. However, we found out the hard way, that the inverse
12 matches
Mail list logo