Re: [slurm-users] Regarding multiple slurm server on one machine

2021-07-27 Thread Valerio Bellizzomi
If you use qemu-kvm beware: qemu-kvm doesn't allow communication of virtual machines with the host, therefore your slurm servers must be all virtual machines. On Wed, 2021-07-28 at 13:55 +1000, Sid Young wrote: > Why not spin them up as Virtual machines... then you could build real > (separate) cl

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Valerio Bellizzomi
On Wed, 2021-06-02 at 22:11 -0700, Ahmad Khalifa wrote: > How to send a job to a particular gpu card using its ID > (0,1,2...etc)? If your GPUs are CUDA I can't help but, if you have OpenCL GPUs then your program can select a GPU with a call to getDeviceIDs() and select the GPU by number. Starting

[slurm-users] Selecting OpenCL GPU reliably

2021-05-06 Thread Valerio Bellizzomi
It is now possible for programs to do a precise and reliable selection of the GPU by first issuing a query to OpenCL with the clGetDeviceInfo() function with the param_name parameter set to cl_khr_pci_bus_info. This extension is available starting from OpenCL 3.0.7 References: - https://github.c

Re: [slurm-users] CUDA vs OpenCL

2021-05-06 Thread Valerio Bellizzomi
y serial number using the rocm-smi interface, this approach is much more reliable than using device ordinals: https://rocmdocs.amd.com/en/latest/ROCm_System_Managment/ROCm-SMI-CLI.html?highlight=showuniqueid > -Original Message- > From: slurm-users On Behalf > Of Valerio Bellizz

Re: [slurm-users] CUDA vs OpenCL

2021-05-06 Thread Valerio Bellizzomi
pen source > components or layers. > > Gareth > > -----Original Message- > From: slurm-users On Behalf > Of Valerio Bellizzomi > Sent: Thursday, 6 May 2021 5:21 PM > To: slurm-users@lists.schedmd.com > Subject: Re: [slurm-users] CUDA vs OpenCL > > On Wed, 2021-0

Re: [slurm-users] CUDA vs OpenCL

2021-05-06 Thread Valerio Bellizzomi
On Wed, 2021-04-28 at 10:56 +0200, Valerio Bellizzomi wrote: > Greetings, > I see here https://slurm.schedmd.com/gres.html#GPU_Management that > CUDA_VISIBLE_DEVICES is available for NVIDIA GPUs, what about OpenCL > GPUs? > > Is there an OPENCL_VISIBLE_DEVICES ? > >

[slurm-users] CUDA vs OpenCL

2021-04-28 Thread Valerio Bellizzomi
Greetings, I see here https://slurm.schedmd.com/gres.html#GPU_Management that CUDA_VISIBLE_DEVICES is available for NVIDIA GPUs, what about OpenCL GPUs? Is there an OPENCL_VISIBLE_DEVICES ? -- Valerio Bellizzomi https://www.selroc.systems http://www.selnet.org

Re: [slurm-users] Using hyperthreaded processors

2020-11-06 Thread Valerio Bellizzomi
On Fri, 2020-11-06 at 13:00 +0100, Diego Zuccato wrote: > Il 04/11/20 19:12, Brian Andrus ha scritto: > > > One thing you will start finding in HPC is that, by it's goal, > > hyperthreading is usually a poor fit. > Depends on many factors, but our tests confirm it can do much good! > > > If you a

Re: [slurm-users] spawning a new terminal for each srun

2019-06-30 Thread Valerio Bellizzomi
On Sun, 2019-06-30 at 18:15 -0700, Chris Samuel wrote: > On Saturday, 29 June 2019 10:33:50 AM PDT Valerio Bellizzomi wrote: > > > no I am using the option --unbuffered to watch the output in a terminal > > window. > > I don't think this is a Slurm issue, you

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Valerio Bellizzomi
n to a location only accessible from the > compute node running your job? You might be able to ssh from the submit host > to the compute node (or maybe from your local computer to the compute node). > > > On Jun 29, 2019, at 10:07 AM, Valerio Bellizzomi wrote: > > > >

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Valerio Bellizzomi
On Sat, 2019-06-29 at 07:57 -0700, Brian Andrus wrote: > I believe you are referring to an interactive terminal window. > > You can do that with srun --pty bash > > Windows themselves are not handled by slurm at all. To have multiple > windows is a function of your workstation. You would need mu

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Valerio Bellizzomi
On Sat, 2019-06-29 at 16:48 +0200, Valerio Bellizzomi wrote: > On Sat, 2019-06-29 at 07:36 -0700, Brian Andrus wrote: > > A little more details of what you are trying to do would help. > > > > multiple srun statements with --pty options will spawn multiple > > termin

Re: [slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Valerio Bellizzomi
it will create a terminal within a terminal. > > So, I would ask: what are you trying to do and we may be able to advise > the best way to accomplish it. > > Brian Andrus > > On 6/29/2019 12:53 AM, Valerio Bellizzomi wrote: > > How it gets done normally ? > > >

[slurm-users] spawning a new terminal for each srun

2019-06-29 Thread Valerio Bellizzomi
How it gets done normally ?

Re: [slurm-users] getting closer

2019-06-29 Thread Valerio Bellizzomi
On Fri, 2019-06-28 at 09:39 +0200, Ole Holm Nielsen wrote: > On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: > > On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: > >> On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: > >>> The nodes are now commun

Re: [slurm-users] getting closer

2019-06-28 Thread Valerio Bellizzomi
On Fri, 2019-06-28 at 09:39 +0200, Ole Holm Nielsen wrote: > On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: > > On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: > >> On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: > >>> The nodes are now commun

Re: [slurm-users] getting closer

2019-06-28 Thread Valerio Bellizzomi
On Fri, 2019-06-28 at 09:39 +0200, Ole Holm Nielsen wrote: > On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: > > On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: > >> On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: > >>> The nodes are now commun

Re: [slurm-users] getting closer

2019-06-28 Thread Valerio Bellizzomi
On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: > On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: > > The nodes are now communicating however when I run the command > > > > srun -w compute02 /bin/ls > > > > it remains stuck and there i

Re: [slurm-users] getting closer

2019-06-27 Thread Valerio Bellizzomi
On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: > The nodes are now communicating however when I run the command > > srun -w compute02 /bin/ls > > it remains stuck and there is no output on the submit machine. > > on the compute02 there is a Communicat

[slurm-users] getting closer

2019-06-27 Thread Valerio Bellizzomi
The nodes are now communicating however when I run the command srun -w compute02 /bin/ls it remains stuck and there is no output on the submit machine. on the compute02 there is a Communication error and Timeout. the network ports 6817 and 6818 are open.

Re: [slurm-users] gpu count

2019-06-27 Thread Valerio Bellizzomi
> On 19-06-27 15:33, Valerio Bellizzomi wrote: > > hello, my node has 2 gpus so I have specified gres=gpus:2 but the > > scontrol show node displays this: > > > > State=IDLE+DRAIN > > Reason=gres/gpus count too low (1 < 2) > > > > > > > > > > >

Re: [slurm-users] gpu count

2019-06-27 Thread Valerio Bellizzomi
On Thu, 2019-06-27 at 15:33 +0200, Valerio Bellizzomi wrote: > hello, my node has 2 gpus so I have specified gres=gpus:2 but the > scontrol show node displays this: > > State=IDLE+DRAIN > Reason=gres/gpus count too low (1 < 2) Also, the node is repeating a debug message: deb

[slurm-users] gpu count

2019-06-27 Thread Valerio Bellizzomi
hello, my node has 2 gpus so I have specified gres=gpus:2 but the scontrol show node displays this: State=IDLE+DRAIN Reason=gres/gpus count too low (1 < 2)

Re: [slurm-users] What means this error ?

2019-06-26 Thread Valerio Bellizzomi
On Wed, 2019-06-26 at 08:23 +0200, Marcus Wagner wrote: > Have you restarted munge on all hosts? Now it works, thanks. > > On 6/25/19 4:38 PM, Valerio Bellizzomi wrote: > > On Tue, 2019-06-25 at 16:32 +0200, Valerio Bellizzomi wrote: > >> On Tue, 2019-06-25 at 08:48 -040

Re: [slurm-users] What means this error ?

2019-06-25 Thread Valerio Bellizzomi
On Tue, 2019-06-25 at 16:32 +0200, Valerio Bellizzomi wrote: > On Tue, 2019-06-25 at 08:48 -0400, Eli V wrote: > > My first guess would be that the host is not listed as one of the two > > controllers in the slurm.conf. Also, keep in mind munge, and thus > > slurm is very

Re: [slurm-users] What means this error ?

2019-06-25 Thread Valerio Bellizzomi
urmd on the compute node refuses to connect to the controller with this error: Protocol authentication error > > > On Tue, Jun 25, 2019 at 1:50 AM Valerio Bellizzomi wrote: > > > > I have installed slurmctld on Debian Testing, trying to start the daemon > > by hand:

[slurm-users] What means this error ?

2019-06-24 Thread Valerio Bellizzomi
I have installed slurmctld on Debian Testing, trying to start the daemon by hand: # /usr/sbin/slurmctld -D -v -f /etc/slurm-llnl/slurm.conf slurmctld: error: High latency for 1000 calls to gettimeofday(): 2072 microseconds slurmctld: pidfile not locked, assuming no running daemon slurmctld: slu

[slurm-users] Recurring error

2018-04-17 Thread Valerio Bellizzomi
Hello, I have a recurring error in the log of slurmctld: [2018-04-10T19:32:40.145] error: _unpack_ret_list: message type 24949, record 0 of 56214 [2018-04-10T19:32:40.145] error: invalid type trying to be freed 24949 [2018-04-10T19:32:40.145] error: unpacking header [2018-04-10T19:32:40.145] erro