[slurm-users] Slurm / OpenHPC socket timeout errors

2018-11-23 Thread Kenneth Roberts
Hi - I have the following on a new cluster with OpenHPC & Slurm built off the latest recipe and packages from OpenHPC (built this week). One master node and 4 compute nodes. NodeName=c[1-4] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN With simple test scripts, sbatch prod

Re: [slurm-users] About x11 support

2018-11-23 Thread Mark Hahn
Sadly that's exactly what I'm saying. Your $DISPLAY variable is : followed by a number and that's what I'm saying that Slurm forbids, though I'm not clear why. The code checks like this: I think it makes sense. Traditionally, DISPLAY=:0 means "the X server on the machine where the client is

Re: [slurm-users] About x11 support

2018-11-23 Thread Chris Samuel
Hi Mahmood, On Saturday, 24 November 2018 6:52:54 AM AEDT Mahmood Naderan wrote: > >I suspect if you do: > >echo $DISPLAY > >it will say something like :0 and Slurm doesn't allow that at present. > > Actually that is not applicable here. Please see below > > [mahmood@rocks7 ~]$ echo $DISPLAY >

Re: [slurm-users] About x11 support

2018-11-23 Thread Mahmood Naderan
>I suspect if you do: >echo $DISPLAY >it will say something like :0 and Slurm doesn't allow that at present. Actually that is not applicable here. Please see below [mahmood@rocks7 ~]$ echo $DISPLAY :1 [mahmood@rocks7 ~]$ srun --x11 --nodelist=compute-0-3 -n 1 -c 6 --mem=8G -A y8 -p RUBY xclock

Re: [slurm-users] Over-subscription for a GRES type

2018-11-23 Thread Paul Browne
Ah, of course, that makes sense, thanks. I guess if we're constraining the devices into job specific cgroups then the Slurmd on the node may know what device is assigned to what job and be able to interrogate resource usage from that but there's no mechanism to do it anything other than that. On F

Re: [slurm-users] Over-subscription for a GRES type

2018-11-23 Thread Mark Hahn
We have a use-case in that the GRES being tracked on a particular partition are GPU cards, but aren't being used by applications that would require them exclusively (lightweight direct rendering rather than GP-GPU/CUDA the issue is that slurm/kernel can't arbitrate resources on the GPU, so overs

[slurm-users] Over-subscription for a GRES type

2018-11-23 Thread Paul Browne
Hello slurm-users, This may be a silly question, but I was curious if the concept of over-subscription on a GRES has come up before, or is currently possible in recent SLURM releases? We have a use-case in that the GRES being tracked on a particular partition are GPU cards, but aren't being

[slurm-users] how to find out why a job won't run?

2018-11-23 Thread Steven Dick
I'm looking for a tool that will tell me why a specific job in the queue is still waiting to run. squeue doesn't give enough detail. If the job is held up on QOS, it's pretty obvious. But if it's resources, it's difficult to tell. If a job is not running because of resources, how can I identify

Re: [slurm-users] About x11 support

2018-11-23 Thread Chris Samuel
On Friday, 23 November 2018 7:34:42 PM AEDT Mahmood Naderan wrote: > Now, the question is, why the following error happens when we now that x11 > support had been enabled during the compilation. > > [mahmood@rocks7 ~]$ srun --x11 --nodelist=compute-0-5 -n 1 -c 6 --mem=8G -A > y8 -p RUBY xclock >

Re: [slurm-users] new user; ExitCode reporting

2018-11-23 Thread Chris Samuel
On Friday, 23 November 2018 10:21:09 PM AEDT Matthew Goulden wrote: > I've spent some time reading through the (excellent, frankly) documentation > for sbatch and job_exit_code and while learning a great deal nothing has > explained with anomaly. I suspect Slurm is trying to be helpful, as exit c

Re: [slurm-users] About x11 support

2018-11-23 Thread Mahmood Naderan
>You would need to manipulate the xauth and DISPLAY settings to make then in a different form (hostname:number or IP:number). This is not hard >when you know the trick... Can you give me a keyword for that to search? I can not understand what is going to be done. Regards, Mahmood

Re: [slurm-users] new user; ExitCode reporting

2018-11-23 Thread mercan
Hi; As far as I know exit code 141 and 13 are the same. Signal + 128 gives exit code: https://slurm-dev.schedmd.narkive.com/MYGH56EW/job-exit-codes Ahmet M. On 23.11.2018 14:36, Matthew Goulden wrote: A confirmation re-run yielded the same outcome but the correct outcome was available

Re: [slurm-users] new user; ExitCode reporting

2018-11-23 Thread Matthew Goulden
A confirmation re-run yielded the same outcome but the correct outcome was available using $ scontrol show job 197 JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=141:0 sacct still reports as before $ sacct -j 197 JobID

[slurm-users] new user; ExitCode reporting

2018-11-23 Thread Matthew Goulden
Hi All, New using migrating from uge/sge, I'm baffled by the ExitCode recording into slurmdb; not sure if this is 'new user' issue or bug, so exposing it here first. Running simple sbatch scripts with these headers relevant #!/bin/bash #SBATCH --mail-user @ #SBATCH --mail-type END #SBATCH -

Re: [slurm-users] About x11 support

2018-11-23 Thread Williams, Gareth (IM&T, Clayton)
Sorry not at a computer.i think the option may be --export You would need to manipulate the xauth and DISPLAY settings to make then in a different form (hostname:number or IP:number). This is not hard when you know the trick... Get Outlook for Android ___

Re: [slurm-users] About x11 support

2018-11-23 Thread Mahmood Naderan
>Then I'd you run something like: >srun --var=DISPLAY xterm There is no such option when I see the manual page https://slurm.schedmd.com/srun.html Should I write "srun --var=:1" ? Regards, Mahmood On Fri, Nov 23, 2018 at 1:09 PM Williams, Gareth (IM&T, Clayton) wrote: > In the vncviewer s

Re: [slurm-users] About x11 support

2018-11-23 Thread Mahmood Naderan
When I connect through vncviewer on my windows machine, I connect through IP:5901. So, the display is 1 and I can confirm that when I open a terminal and write "echo $DISPLAY" which returns ":1". Also, when another user connects through IP:5914, he can see the "echo $DISPLAY" will return ":14".

Re: [slurm-users] About x11 support

2018-11-23 Thread Williams, Gareth (IM&T, Clayton)
In the vncviewer session, what is DISPLAY set to? I guess it will be something like head.mydomain.com:1 and you can run x applications that done need much resource. Then I'd you run something like: srun --var=DISPLAY xterm Or sbatch with this script: #!/bin/bash xterm You should get an xterm in

Re: [slurm-users] About x11 support

2018-11-23 Thread Mahmood Naderan
Hi Gareth, Thanks for the info. My cluster is not a big one and I have configured in the following way. 1- A frontend which has the rocks 7 (based on centos 7) with gnome. Users login to this node *only* via vncviewer. 2- While a user is connected to his gnome desktop, he opens a terminal and may r