Re: [slurm-users] How to use a pyhon virtualenv with srun?

Yann Bouteiller Sun, 17 Nov 2019 22:41:15 -0800

Hi Gareth, thank you for your answer,

I have thought about it too, but I think the --block option that I usein ray start is supposed to sleep indefinitely for this not to happen.However maybe this is not taken into account due to the fact that Iuse '&' at this end of the ray start command in the install_worker.shscript for this call to be non-blocking when I launchinstall_worker.sh with srun in the parent script?


Yann


"Williams, Gareth (IM&T, Black Mountain)" <gareth.willi...@csiro.au> a écrit :

Hi Yann,

The remaining problem may be that the ray processes are not waitedon. I'm not sure, but hope this gets you looking in the right place.You may need to sleep indefinitely in the scripts that run theworker ray processes then when the master is finished making themwork, cancel the workers then exit the main script. If you justexit the main script computecanada will probably clean up for youautomatically - but it is polite to clean up after yourself.


Gareth

-----Original Message-----

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On BehalfOf Yann Bouteiller

Sent: Monday, 18 November 2019 1:49 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] How to use a pyhon virtualenv with srun?

Hello Brian, thank you for your answer.

Actually, you are not allowed to install things in your home oncomputecanada, this is why you need to install everything in avirtualenv with pip install. Also, you have to install eachvirtualenv in $SLURM_TMDIR which is the local drive of the node,because everything else is slow, so I think I cannot share homes.

Actually I succeeded at installing different virtualenvs ondifferent nodes using a script for each worker that creates a localvirtualenv, installs ray on it, and connects to the ray serverrunning in the virtualenv of the head node (I mean the primary node,yes). I just call these scripts with srun. However, for some reason,the workers seem to connect fine to the server but are detected asdead after a

while: https://groups.google.com/forum/#!topic/ray-dev/INB_zVS5PWY

Yann



Brian Andrus <toomuc...@gmail.com> a écrit :

I suspect when you say "head node" you mean the primary node from the
nodes your were allocated.

Normally, when you use pip as a user, it installs in your home
directory. Are you certain all your nodes share the same homes?
If they are merely synched, that would not be the same. Not actually
sharing homes could be the cause.

Brian Andrus


On 11/17/2019 11:24 AM, Yann Bouteiller wrote:


Hello,

I am trying to do this on computecanada, which is managed by slurm:
https://ray.readthedocs.io/en/latest/deploying-on-slurm.html

However, on computecanada, you cannot install things on nodes before
the job has started, and you can only install things in a python
virtualenv once the job has started.

I can do:

```
module load python/3.7.4
source venv/bin/activate
pip install ray
```

in the bash script before calling everything else, but apparently
this will only create-activate the virtualenv and install ray on the
head node, but not on the remote nodes, so calling

```
srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
--redis-port=6379 --redis-password=$redis_password & # Starting the
head ```

will succeed, but later calling

```
for ((  i=1; i<=$worker_num; i++ ))
do
  node2=${nodes_array[$i]}
  srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block
--address=$ip_head --redis-pass$
  sleep 5
done

```

will produce the following error:

```
slurmstepd: error: execve(): ray: No such file or directory
srun: error: cdr768: task 0: Exited with exit code 2
srun: Terminating job step 31218604.3 [2]+  Exit 2                 
srun --export=ALL --nodes=1
--ntasks=1 -w $node2 ray start --block --address=$ip_head
--redis-password=$redis_password ```

How can I tackle this issue, please? I am a beginner with slurm so I
am not sure what is the problem here. Here is my whole sbatch
script:

```
#!/bin/bash

#SBATCH --job-name=test
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=1000M
#SBATCH --nodes=3
#SBATCH --tasks-per-node 1

worker_num=2 # Must be one less that the total number of nodes
nodes=$(scontrol show hostnames $SLURM_JOB_NODELIST) # Getting the
node names nodes_array=( $nodes )

module load python/3.7.4
source venv/bin/activate
pip install ray

node1=${nodes_array[0]}
ip_prefix=$(srun --nodes=1 --ntasks=1 -w $node1 hostname
--ip-address) # Making address
suffix=':6379'
ip_head=$ip_prefix$suffix
redis_password=$(uuidgen)
export ip_head # Exporting for latter access by trainer.py

srun --nodes=1 --ntasks=1 -w $node1 ray start --block --head
--redis-port=6379 --redis-password=$redis_password & # Starting the
head sleep 5

for ((  i=1; i<=$worker_num; i++ ))
do
  node2=${nodes_array[$i]}
  srun --export=ALL --nodes=1 --ntasks=1 -w $node2 ray start --block
--address=$ip_head --redis-password=$redis_password & # Starting the
workers
  sleep 5
done

python -u trainer.py $redis_password 15 # Pass the total number of
allocated CPUs

```

---
Regards,
Yann

Re: [slurm-users] How to use a pyhon virtualenv with srun?

Reply via email to