Yes. The x11 also worked for us outside of slurm. Well, good luck finding your issue.
On Tue, Jun 12, 2018, 1:09 AM Christopher Benjamin Coffey < chris.cof...@nau.edu> wrote: > Hi Hadrian, > > Thank you, unfortunately that is not the issue. We can connect to the > nodes outside of slurm and have the X11 stuff work properly. > > Best, > Chris > > — > Christopher Coffey > High-Performance Computing > Northern Arizona University > 928-523-1167 > > > On 6/7/18, 6:49 PM, "slurm-users on behalf of Hadrian Djohari" < > slurm-users-boun...@lists.schedmd.com on behalf of hx...@case.edu> wrote: > > Hi, > > > I do not remember whether we had the same error message. > But, if the user's known_host has an old entry of the node he is > trying to connect, the x11 won't connect properly. > Once the known_host entry has been deleted, the x11 connects just fine. > > > Hadrian > > > On Thu, Jun 7, 2018 at 6:26 PM, Christopher Benjamin Coffey > <chris.cof...@nau.edu> wrote: > > Hi, > > I've compiled slurm 17.11.7 with x11 support. We can ssh to a node > from the login node and get xeyes to work, etc. However, srun --x11 xeyes > results in: > > [cbc@wind ~ ]$ srun --x11 --reservation=root_58 xeyes > X11 connection rejected because of wrong authentication. > Error: Can't open display: localhost:60.0 > srun: error: cn100: task 0: Exited with exit code 1 > > On the node in slurmd.log it says: > > [2018-06-07T15:04:29.932] _run_prolog: run job script took usec=1 > [2018-06-07T15:04:29.932] _run_prolog: prolog with lock for job > 11806306 ran for 0 seconds > [2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: > /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB > memsw.limit=1000MB > [2018-06-07T15:04:29.957] [11806306.extern] task/cgroup: > /slurm/uid_3301/job_11806306/step_extern: alloc=1000MB mem.limit=1000MB > memsw.limit=1000MB > [2018-06-07T15:04:30.138] [11806306.extern] X11 forwarding established > on DISPLAY=cn100:60.0 > [2018-06-07T15:04:30.239] launch task 11806306.0 request from > 3301.3302@172.16.3.21 <mailto:3301.3302@172.16.3.21> (port 32453) > [2018-06-07T15:04:30.240] lllp_distribution jobid [11806306] implicit > auto binding: cores,one_thread, dist 1 > [2018-06-07T15:04:30.240] _task_layout_lllp_cyclic > [2018-06-07T15:04:30.240] _lllp_generate_cpu_bind jobid [11806306]: > mask_cpu,one_thread, 0x0000001 > [2018-06-07T15:04:30.268] [11806306.0] task/cgroup: > /slurm/uid_3301/job_11806306: alloc=1000MB mem.limit=1000MB > memsw.limit=1000MB > [2018-06-07T15:04:30.268] [11806306.0] task/cgroup: > /slurm/uid_3301/job_11806306/step_0: alloc=1000MB mem.limit=1000MB > memsw.limit=1000MB > [2018-06-07T15:04:30.303] [11806306.0] task_p_pre_launch: Using > sched_affinity for tasks > [2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: > remote disconnected > [2018-06-07T15:04:30.310] [11806306.extern] error: _handle_channel: > exiting thread > [2018-06-07T15:04:30.376] [11806306.0] done with job > [2018-06-07T15:04:30.413] [11806306.extern] x11 forwarding shutdown > complete > [2018-06-07T15:04:30.443] [11806306.extern] _oom_event_monitor: > oom-kill event count: 1 > [2018-06-07T15:04:30.508] [11806306.extern] done with job > > It seems like its close, as srun, and the node can agree on the port > to connect on, but there is a difference between slurmd specifying the node > name and port, where srun is trying to connect via localhost and the same > port. Maybe I have an ssh setting wrong > somewhere? I've tried all combinations I believe in ssh_config, and > sshd_config. No issues with /home either, it’s a shared filesystem that > each node mounts, and we even tried no_root_squash so root can write to the > .Xauthority file like some folks have suggested. > > Also, xauth list shows that there was no magic cookie written for host > cn100: > > [cbc@wind ~ ]$ xauth list > wind.hpc.nau.edu/unix:14 < > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwind.hpc.nau.edu%2Funix%3A14&data=02%7C01%7Cchris.coffey%40nau.edu%7Cff0e3e30539f4411850908d5cce220a0%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C636640193976928475&sdata=7RP3G%2FgProB9cc00B7XSeqRK12OGgmHYsbMRx4jBJs4%3D&reserved=0> > > MIT-MAGIC-COOKIE-1 ac4a0f1bfe9589806f81dd45306ee33d > > Something preventing root from writing the magic cookie? The file is > definitely writeable: > > [root@cn100 ~]# touch /home/cbc/.Xauthority > [root@cn100 ~]# > > Anyone have any ideas? Thanks! > > Best, > Chris > > — > Christopher Coffey > High-Performance Computing > Northern Arizona University > 928-523-1167 > > > > > > > > > > -- > Hadrian Djohari > Manager of Research Computing Services, [U]Tech > Case Western Reserve University > (W): 216-368-0395 > (M): 216-798-7490 > > > > > > > > > > >