I have not had a chance to look at you rcode, but find it intriguing,
although I am not sure about use cases. Do you do anything to lock out
other jobs from the affected node?
E.g., you submit a job with unsatisfiable constraint foo.
The tool scanning the cluster detects a job queued with foo cons
Hi All,
I have developed a first solution to this issue that I brought up back in early
July. I don't think it is complete enough to be the final solution for everyone
but it does work and I think it's a good starting place to showcase the value
of this feature and iterate for improvement. I wa
Hey stijn,
thank you very much for the advice!
Answer to your questions:
Q: are you using rdma-core with mellanox ofed?
A: only mellanox ofed, no rdma-core
Q: and do you have any uverbs_write error messages in dmesg on the hosts?
A: Yes, I have!
I have set: 'UCX_TLS=tcp,self,sm' on the slurmd'
Hello Tina,
Thank you for the suggestions and responses!!!
As of right now, it seems to be working with taking off the “CPUs=“ all
together from gres.conf. The original thought process was to have 4 set aside
to always go to the gpu; not so sure that is necessary as long as the CPU
partition can