This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases.
On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand <rpwieg...@gmail.com> wrote: > I dug into the logs on both the slurmctld side and the slurmd side. > For the record, I have debug2 set for both and > DebugFlags=CPU_BIND,Gres. > > I cannot see much that is terribly relevant in the logs. There's a > known parameter error reported with the memory cgroup specifications, > but I don't think that is germane. > > When I set "--gres=gpu:1", the slurmd log does have encouraging lines such > as: > > [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device > /dev/nvidia0 for job > [2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to > device /dev/nvidia1 for job > > However, I can still "see" both devices from nvidia-smi, and I can > still access both if I manually unset CUDA_VISIBLE_DEVICES. > > When I do *not* specify --gres at all, there is no reference to gres, > gpu, nvidia, or anything similar in any log at all. And, of course, I > have full access to both GPUs. > > I am happy to attach the snippets of the relevant logs, if someone > more knowledgeable wants to pour through them. I can also set the > debug level higher, if you think that would help. > > > Assuming upgrading will solve our problem, in the meantime: Is there > a way to ensure that the *default* request always has "--gres=gpu:1"? > That is, this situation is doubly bad for us not just because there is > *a way* around the resource management of the device but also because > the *DEFAULT* behavior if a user issues an srun/sbatch without > specifying a Gres is to go around the resource manager. > > > > On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel <ch...@csamuel.org> > wrote: > > On 02/05/18 10:15, R. Paul Wiegand wrote: > > > >> Yes, I am sure they are all the same. Typically, I just scontrol > >> reconfig; however, I have also tried restarting all daemons. > > > > > > Understood. Any diagnostics in the slurmd logs when trying to start > > a GPU job on the node? > > > >> We are moving to 7.4 in a few weeks during our downtime. We had a > >> QDR -> OFED version constraint -> Lustre client version constraint > >> issue that delayed our upgrade. > > > > > > I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if > > you need current security fixes. > > > >> Should I just wait and test after the upgrade? > > > > > > Well 17.11.6 will be out then that will include for a deadlock > > that some sites hit occasionally, so that will be worth throwing > > into the mix too. Do read the RELEASE_NOTES carefully though, > > especially if you're using slurmdbd! > > > > > > All the best, > > Chris > > -- > > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC > > > >