Re: [slurm-users] Major newbie - Slurm/jupyterhub

Renfro, Michael Tue, 05 May 2020 09:25:21 -0700

Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] 
folder structure for CUDA and other third-party software. That handles 
LD_LIBRARY_PATH and other similar variables, reduces the chances for library 
conflicts, and lets users decide their environment on a per-job basis. Ours 
includes a basic Miniconda installation, and the users can make their own 
environments from there [3]. I very rarely install a system-wide Python module.


[1] http://modules.sourceforge.net
[2] https://lmod.readthedocs.io/
[3] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+Jupyter+Notebook

> On May 5, 2020, at 9:37 AM, Lisa Kay Weihl <lwe...@bgsu.edu> wrote:
> 
> Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my 
> home directory.  That enabled me to find out that it could not find the path 
> for batchspawner-singleuser. 
> 
> 
> So I added this to jupyter_config.py
> export PATH=/opt/rh/rh-python36/root/bin:$PATH
> 
> 
> That seemed to now allow the server to launch for my user that I use for all 
> the configuration work. I get errors (see below) but the notebook loads. The 
> problem is I'm not sure how to kill the job in the Slurm queue or the 
> notebook server if I finish before the job times out and kills it. Logout 
> doesn't seem to do it.
> 
> It still doesn't work for a regular user (see below)
> 
> I think my problems all have to do with Slurm/jupyterhub finding python. So I 
> have some questions about the best way to set it up for multiple users and 
> make it work for this.
> 
> I use CentOS distribution so that if the university admins will ever have to 
> take over it will match their RedHat setups they use. I know on all Linux 
> distros you need to leave the python 2 system install alone. It looks like as 
> of CentOS 7.7 there is now a python3 in the repository. I didn't go that 
> route because in the past I installed the python from RedHat Software 
> Collection which is what I did this time.
> I don't know if that's the best route for this use case. They also say don't 
> sudo pip3 to try to install global packages but does that mean sudo to root 
> and then using pip3 is okay?
> 
> When I test and faculty don't give me code I go to the web and try to find 
> examples. I know I also wanted to try to test the GPUs from within the 
> notebook. I have 2 examples:
> 
> Example 1 uses these modules:
> import numpy as np
> import xgboost as xgb
> from sklearn import datasets
> from sklearn.model_selection import train_test_split
> from sklearn.datasets import dump_svmlight_file
> from sklearn.externals import joblib
> from sklearn.metrics import precision_score
> 
> It gives error: cannot load library 
> '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': 
> libcudart.so.9.2: cannot open shared object file: No such file or directory
> 
> libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib
> 
> Does this mean I need LD_LIBRARY_PATH  set also? Cuda was installed with 
> typical NVIDIA instructions using their repo.
> 
> Example 2 uses these modules:
> import numpy as np
> from numba import vectorize
> 
> And gives error:  NvvmSupportError: libNVVM cannot be found. Do `conda 
> install cudatoolkit`:
> library nvvm not found
> 
> I don't have conda installed. Will that interfere with pip3?
> 
> Part II - using jupyterhub with regular user gives different error
> 
> I'm assuming this is a python path issue?
> 
>  File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in 
> <module>
>     __import__('pkg_resources').require('batchspawner==1.0.0rc0')
> and later
> pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution 
> was not found and is required by the application
> 
> Thanks again for any help especially if you can help clear up python 
> configuration.
> 
> 
> ***************************************************************
> Lisa Weihl Systems Administrator
> Computer Science, Bowling Green State University
> Tel: (419) 372-0116   |    Fax: (419) 372-8061
> lwe...@bgsu.edu
> www.bgsu.edu
> 
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
> slurm-users-requ...@lists.schedmd.com <slurm-users-requ...@lists.schedmd.com>
> Sent: Tuesday, May 5, 2020 4:59 AM
> To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
> Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8
>  
> Send slurm-users mailing list submissions to
>         slurm-users@lists.schedmd.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         
> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&amp;data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&amp;sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3D&amp;reserved=0
> or, via email, send a message with subject or body 'help' to
>         slurm-users-requ...@lists.schedmd.com
> 
> You can reach the person managing the list at
>         slurm-users-ow...@lists.schedmd.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of slurm-users digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: Major newbie - Slurm/jupyterhub (Guy Coates)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 5 May 2020 09:59:01 +0100
> From: Guy Coates <guy.coa...@gmail.com>
> To: Slurm User Community List <slurm-users@lists.schedmd.com>
> Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub
> Message-ID:
>         <CAF+C_WtuRQkM+p8v8EnS-ADz2-rDLevEHijdbi+X9HsweL=b...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Lisa,
> 
> Below is my jupyterhub slurm config. It uses the profiles, which allows you
> to spawn different sized jobs.  I found the most useful thing for debugging
> is to make sure that the --output option is being honoured; any jupyter
> python errors will end up there, and to to explicitly set the python
> environment at the start of the script. (The example below uses conda,
> replace with whatever makes sense in your environment).
> 
> Hope that helps,
> 
> Guy
> 
> 
> #Extend timeouts to deal with slow job launch
> c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
> c.Spawner.start_timeout=120
> c.Spawner.term_timeout=20
> c.Spawner.http_timeout = 120
> 
> # Set up the various sizes of job
> c.ProfilesSpawner.profiles = [
> ("Local server: (Run on local machine)", "local",
> "jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}),
> ("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")),
> ("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1",
> "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")),
> ("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1",
> "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M")),
> ("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1",
> "batchspawner.SlurmSpawner",
>  dict(req_options=" -n 32 -N 1  -t 48:00:00 -p normal --mem=127000M
> --gres=gpu:k40:2")),
> ]
> 
> #Configure the batch job. Make sure --output is set and explicitly set up
> #the jupyterhub python environment
> c.SlurmSpawner.batch_script = """#!/bin/bash
> #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log
> #SBATCH --job-name=spawner-jupyterhub
> #SBATCH --chdir={homedir}
> #SBATCH --export={keepvars}
> #SBATCH --get-user-env=L
> #SBATCH {options}
> trap 'echo SIGTERM received' TERM
>  . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh
> conda activate /usr/local/jupyterhub/jupyterhub
> which jupyterhub-singleuser
> {cmd}
> echo "jupyterhub-singleuser ended gracefully"
> """
> 
> On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl <lwe...@bgsu.edu> wrote:
> 
> > I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX
> > 2080 Ti).
> >
> > Use is to be for GPU ML computing and python based data science.
> >
> > One faculty wants jupyter notebooks, other faculty member is used to using
> > CUDA for GPU but has only done it on a workstation in his lab with a GUI.
> > New faculty member coming in has used nvidia-docker container for GPU (I
> > think on a large cluster, we are just getting started)
> >
> > I'm charged with making all this work and hopefully all at once. Right now
> > I'll take one thing working.
> >
> > So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE
> > Linux enabled). I posted once before about having trouble getting that
> > combination correct and I finally worked that out. Most of the tests in the
> > test suite seem to run okay. I'm trying to start with very basic Slurm
> > configuration so I haven't enabled accounting.
> >
> > *For reference here is my slurm.conf*
> >
> > # slurm.conf file generated by configurator easy.html.
> >
> > # Put this file on all nodes of your cluster.
> >
> > # See the slurm.conf man page for more information.
> >
> > #
> >
> > SlurmctldHost=cs-host
> >
> >
> > #authentication
> >
> > AuthType=auth/munge
> >
> > CacheGroups = 0
> >
> > CryptoType=crypto/munge
> >
> >
> > #Add GPU support
> >
> > GresTypes=gpu
> >
> >
> > #
> >
> > #MailProg=/bin/mail
> >
> > MpiDefault=none
> >
> > #MpiParams=ports=#-#
> >
> >
> > #service
> >
> > ProctrackType=proctrack/cgroup
> >
> > ReturnToService=1
> >
> > SlurmctldPidFile=/var/run/slurmctld.pid
> >
> > #SlurmctldPort=6817
> >
> > SlurmdPidFile=/var/run/slurmd.pid
> >
> > #SlurmdPort=6818
> >
> > SlurmdSpoolDir=/var/spool/slurmd
> >
> > SlurmUser=slurm
> >
> > #SlurmdUser=root
> >
> > StateSaveLocation=/var/spool/slurmctld
> >
> > SwitchType=switch/none
> >
> > TaskPlugin=task/affinity
> >
> > #
> >
> > #
> >
> > # TIMERS
> >
> > #KillWait=30
> >
> > #MinJobAge=300
> >
> > #SlurmctldTimeout=120
> >
> > SlurmdTimeout=1800
> >
> > #
> >
> > #
> >
> > # SCHEDULING
> >
> > SchedulerType=sched/backfill
> >
> > SelectType=select/cons_tres
> >
> > SelectTypeParameters=CR_Core_Memory
> >
> > PriorityType=priority/multifactor
> >
> > PriorityDecayHalfLife=3-0
> >
> > PriorityMaxAge=7-0
> >
> > PriorityFavorSmall=YES
> >
> > PriorityWeightAge=1000
> >
> > PriorityWeightFairshare=0
> >
> > PriorityWeightJobSize=125
> >
> > PriorityWeightPartition=1000
> >
> > PriorityWeightQOS=0
> >
> > #
> >
> > #
> >
> > # LOGGING AND ACCOUNTING
> >
> > AccountingStorageType=accounting_storage/none
> >
> > ClusterName=cs-host
> >
> > #JobAcctGatherFrequency=30
> >
> > JobAcctGatherType=jobacct_gather/none
> >
> > SlurmctldDebug=info
> >
> > SlurmctldLogFile=/var/log/slurmctld.log
> >
> > #SlurmdDebug=info
> >
> > SlurmdLogFile=/var/log/slurmd.log
> >
> > #
> >
> > #
> >
> > # COMPUTE NODES
> >
> > NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6
> > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4
> >
> >
> > #PARTITIONS
> >
> > PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES
> > MaxTime=INFINITE State=UP
> >
> > PartitionName=faculty  Priority=10 Default=YES
> >
> >
> > I have jupyterhub running as part of RedHat SCL. It works fine with no
> > integration with Slurm. Now I'm trying to use batchspawner to start a
> > server for the user.  Right now I'm just trying one configuration from
> > within the jupyterhub_config.py and trying to keep it simple (see below).
> >
> > *When I connect I get this error:*
> > 500: Internal Server Error
> > Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job
> > has disappeared while pending in the queue or died immediately after
> > starting.
> >
> > *In the jupyterhub.log:*
> >
> > [I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin
> >
> > [I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next=
> > -> /hub/spawn (csadmin@127.0.0.1) 227.13ms
> >
> > [I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting
> > job using sudo -E -u csadmin sbatch --parsable
> >
> > [I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted
> > script:
> >
> >     #!/bin/bash
> >
> >     #SBATCH --partition=faculty
> >
> >     #SBATCH --time=8:00:00
> >
> >     #SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log
> >
> >     #SBATCH --job-name=jupyterhub-spawner
> >
> >     #SBATCH --cpus-per-task=1
> >
> >     #SBATCH --chdir=/home/csadmin
> >
> >     #SBATCH --uid=csadmin
> >
> >
> >
> >     env
> >
> >     which jupyterhub-singleuser
> >
> >     batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0
> >
> >
> >
> > [I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted.
> > cmd: sudo -E -u csadmin sbatch --parsable output: 7117
> >
> > [W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job  neither
> > pending nor running.
> >
> >
> >
> > [E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting
> > csadmin's server: The Jupyter batch job has disappeared while pending in
> > the queue or died immediately after starting.
> >
> > [W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn
> > (127.0.0.1): Error in Authenticator.pre_spawn_start: RuntimeError The
> > Jupyter batch job has disappeared while pending in the queue or died
> > immediately after starting.
> >
> > [E 2020-05-04 19:47:59.521 JupyterHub log:166] {
> >
> >       "X-Forwarded-Host": "localhost:8000",
> >
> >       "X-Forwarded-Proto": "http",
> >
> >       "X-Forwarded-Port": "8000",
> >
> >       "X-Forwarded-For": "127.0.0.1",
> >
> >       "Cookie": "jupyterhub-hub-login=[secret]; _xsrf=[secret];
> > jupyterhub-session-id=[secret]",
> >
> >       "Accept-Language": "en-US,en;q=0.9",
> >
> >       "Accept-Encoding": "gzip, deflate, br",
> >
> >       "Referer": "http://localhost:8000/hub/login";,
> >
> >       "Sec-Fetch-Dest": "document",
> >
> >       "Sec-Fetch-User": "?1",
> >
> >       "Sec-Fetch-Mode": "navigate",
> >
> >       "Sec-Fetch-Site": "same-origin",
> >
> >       "Accept":
> > "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
> >
> >       "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6)
> > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
> >
> >       "Upgrade-Insecure-Requests": "1",
> >
> >       "Cache-Control": "max-age=0",
> >
> >       "Connection": "close",
> >
> >       "Host": "localhost:8000"
> >
> >     }
> >
> > [E 2020-05-04 19:47:59.522 JupyterHub log:174] 500 GET /hub/spawn (
> > csadmin@127.0.0.1) 842.87ms
> >
> > [I 2020-05-04 19:49:05.294 JupyterHub proxy:320] Checking routes
> >
> > [I 2020-05-04 19:54:05.292 JupyterHub proxy:320] Checking routes
> >
> >
> >
> > *In the slurmd.log (which I don't see as helpful):*
> >
> >
> > [2020-05-04T19:47:58.931] task_p_slurmd_batch_request: 7117
> >
> > [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU input mask for node:
> > 0x000003
> >
> > [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU final HW mask for
> > node: 0x001001
> >
> > [2020-05-04T19:47:58.932] _run_prolog: run job script took usec=473
> >
> > [2020-05-04T19:47:58.932] _run_prolog: prolog with lock for job 7117 ran
> > for 0 seconds
> >
> > [2020-05-04T19:47:58.932] Launching batch job 7117 for UID 1001
> >
> > [2020-05-04T19:47:58.967] [7117.batch] task_p_pre_launch: Using
> > sched_affinity for tasks
> >
> > [2020-05-04T19:47:58.978] [7117.batch] sending
> > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:32512
> >
> > [2020-05-04T19:47:58.982] [7117.batch] done with job
> >
> >
> > *In the jupyterhub_config.py (just the part for batchspawner):*
> >
> >
> >
> >
> >
> >
> >
> > * c = get_config() c.JupyterHub.spawner_class =
> > 'batchspawner.SlurmSpawner' # Even though not used, needed to register
> > batchspawner interface import batchspawner     c.Spawner.http_timeout = 120
> > c.SlurmSpawner.req_nprocs = '1' c.SlurmSpawner.req_runtime = '8:00:00'
> > c.SlurmSpawner.req_partition = 'faculty' c.SlurmSpawner.req_memory =
> > '128gb' c.SlurmSpawner.start_timeout = 240 c.SlurmSpawner.batch_script =
> > '''#!/bin/bash #SBATCH --partition={partition} #SBATCH --time={runtime}
> > #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log #SBATCH
> > --job-name=jupyterhub-spawner #SBATCH --cpus-per-task={nprocs} #SBATCH
> > --chdir=/home/{username} #SBATCH --uid={username} env which
> > jupyterhub-singleuser {cmd} ''' *
> >
> > I will admit that I don't understand all of this completely as I haven't
> > written a lot of bash scripts. I'm getting that some of the things in {}
> > are environment variables and others come from within this file and it
> > seems they must be specifically defined in the batchspawner software
> > somewhere.
> >
> > Is the last piece trying to find the path of jupyterhub-singleuser and
> > then launch it with {cmd}
> >
> > Feel free to tell me to go read the docs but be gentle ? Because of the
> > request to make ALL of this work ASAP I've been skimming and trying to pick
> > up as much as I can and then going off examples trying to make this work.
> > I have a feeling that this command: sudo -E -u csadmin sbatch --parsable
> > output: 7117
> > is what is incorrect and causing the problems. Clearly something isn't
> > starting that should be.
> >
> > If you can shed any light on anything or any info online that might help
> > me I'd much appreciate it. I'm really beating my head over this one and I
> > know inexperience isn't helping.
> >
> > When I figure out this simple config then I want to have the profile where
> > I can setup several settings and have the user select.
> >
> > One other basic question. I'm assuming in Slurm language my server is
> > considered to have 24 CPU with the cores and threads so that any of the
> > Slurm settings that refer to things like CPU per task I could specify up to
> > 24 if a user wanted. Also, in this case the node will always be 1 since we
> > only have 1 server.
> >
> > Thanks!
> >
> > ***************************************************************
> >
> > Lisa Weihl *Systems Administrator*
> >
> >
> > *Computer Science, Bowling Green State University *Tel: (419) 372-0116
> > |    Fax: (419) 372-8061
> > lwe...@bgsu.edu
> > http://www.bgsu.edu/?
> >
> 
> 
> -- 
> Dr. Guy Coates
> +44(0)7801 710224
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20200505%2F2a15c990%2Fattachment.htm&amp;data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&amp;sdata=dK0HAGC6RxwkysiGgYXwtYA1dxL7HPIwnEcF0LS2Nn8%3D&amp;reserved=0>
> 
> End of slurm-users Digest, Vol 31, Issue 8
> ******************************************

Re: [slurm-users] Major newbie - Slurm/jupyterhub

Reply via email to