Aside from any Slurm configuration, I’d recommend setting up a modules [1 or 2] folder structure for CUDA and other third-party software. That handles LD_LIBRARY_PATH and other similar variables, reduces the chances for library conflicts, and lets users decide their environment on a per-job basis. Ours includes a basic Miniconda installation, and the users can make their own environments from there [3]. I very rarely install a system-wide Python module.
[1] http://modules.sourceforge.net [2] https://lmod.readthedocs.io/ [3] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+Jupyter+Notebook > On May 5, 2020, at 9:37 AM, Lisa Kay Weihl <lwe...@bgsu.edu> wrote: > > Thanks Guy, I did find that there was a jupyterhub_slurmspawner log in my > home directory. That enabled me to find out that it could not find the path > for batchspawner-singleuser. > > > So I added this to jupyter_config.py > export PATH=/opt/rh/rh-python36/root/bin:$PATH > > > That seemed to now allow the server to launch for my user that I use for all > the configuration work. I get errors (see below) but the notebook loads. The > problem is I'm not sure how to kill the job in the Slurm queue or the > notebook server if I finish before the job times out and kills it. Logout > doesn't seem to do it. > > It still doesn't work for a regular user (see below) > > I think my problems all have to do with Slurm/jupyterhub finding python. So I > have some questions about the best way to set it up for multiple users and > make it work for this. > > I use CentOS distribution so that if the university admins will ever have to > take over it will match their RedHat setups they use. I know on all Linux > distros you need to leave the python 2 system install alone. It looks like as > of CentOS 7.7 there is now a python3 in the repository. I didn't go that > route because in the past I installed the python from RedHat Software > Collection which is what I did this time. > I don't know if that's the best route for this use case. They also say don't > sudo pip3 to try to install global packages but does that mean sudo to root > and then using pip3 is okay? > > When I test and faculty don't give me code I go to the web and try to find > examples. I know I also wanted to try to test the GPUs from within the > notebook. I have 2 examples: > > Example 1 uses these modules: > import numpy as np > import xgboost as xgb > from sklearn import datasets > from sklearn.model_selection import train_test_split > from sklearn.datasets import dump_svmlight_file > from sklearn.externals import joblib > from sklearn.metrics import precision_score > > It gives error: cannot load library > '/home/csadmin/.local/lib/python3.6/site-packages/librmm.so': > libcudart.so.9.2: cannot open shared object file: No such file or directory > > libcudart.so is in: /usr/local/cuda-10.2/targets/x86_64-linux/lib > > Does this mean I need LD_LIBRARY_PATH set also? Cuda was installed with > typical NVIDIA instructions using their repo. > > Example 2 uses these modules: > import numpy as np > from numba import vectorize > > And gives error: NvvmSupportError: libNVVM cannot be found. Do `conda > install cudatoolkit`: > library nvvm not found > > I don't have conda installed. Will that interfere with pip3? > > Part II - using jupyterhub with regular user gives different error > > I'm assuming this is a python path issue? > > File "/opt/rh/rh-python36/root/bin/batchspawner-singleuser", line 4, in > <module> > __import__('pkg_resources').require('batchspawner==1.0.0rc0') > and later > pkg_resources.DistributionNotFound: The 'batchspawner==1.0.0rc0' distribution > was not found and is required by the application > > Thanks again for any help especially if you can help clear up python > configuration. > > > *************************************************************** > Lisa Weihl Systems Administrator > Computer Science, Bowling Green State University > Tel: (419) 372-0116 | Fax: (419) 372-8061 > lwe...@bgsu.edu > www.bgsu.edu > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > slurm-users-requ...@lists.schedmd.com <slurm-users-requ...@lists.schedmd.com> > Sent: Tuesday, May 5, 2020 4:59 AM > To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> > Subject: [EXTERNAL] slurm-users Digest, Vol 31, Issue 8 > > Send slurm-users mailing list submissions to > slurm-users@lists.schedmd.com > > To subscribe or unsubscribe via the World Wide Web, visit > > https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&sdata=Nh9fjFzOGIXhLhdnbyiLc9oIENdpVkVl%2F5hysXbkMT8%3D&reserved=0 > or, via email, send a message with subject or body 'help' to > slurm-users-requ...@lists.schedmd.com > > You can reach the person managing the list at > slurm-users-ow...@lists.schedmd.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of slurm-users digest..." > > > Today's Topics: > > 1. Re: Major newbie - Slurm/jupyterhub (Guy Coates) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 5 May 2020 09:59:01 +0100 > From: Guy Coates <guy.coa...@gmail.com> > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Major newbie - Slurm/jupyterhub > Message-ID: > <CAF+C_WtuRQkM+p8v8EnS-ADz2-rDLevEHijdbi+X9HsweL=b...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi Lisa, > > Below is my jupyterhub slurm config. It uses the profiles, which allows you > to spawn different sized jobs. I found the most useful thing for debugging > is to make sure that the --output option is being honoured; any jupyter > python errors will end up there, and to to explicitly set the python > environment at the start of the script. (The example below uses conda, > replace with whatever makes sense in your environment). > > Hope that helps, > > Guy > > > #Extend timeouts to deal with slow job launch > c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner' > c.Spawner.start_timeout=120 > c.Spawner.term_timeout=20 > c.Spawner.http_timeout = 120 > > # Set up the various sizes of job > c.ProfilesSpawner.profiles = [ > ("Local server: (Run on local machine)", "local", > "jupyterhub.spawner.LocalProcessSpawner", {'ip':'0.0.0.0'}), > ("Single CPU: (1 CPU, 8GB, 48 hrs)", "cpu1", "batchspawner.SlurmSpawner", > dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G ")), > ("Single GPU: (1 CPU, 1 GPU, 8GB, 48 hrs)", "gpu1", > "batchspawner.SlurmSpawner", > dict(req_options=" -n 1 -t 48:00:00 -p normal --mem=8G --gres=gpu:k40:1")), > ("Whole Node: (32 CPUs, 128 GB, 48 hrs)", "node1", > "batchspawner.SlurmSpawner", > dict(req_options=" -n 32 -N 1 -t 48:00:00 -p normal --mem=127000M")), > ("Whole GPU Node: (32 CPUs, 2 GPUs, 128GB, 48 hrs)", "gnode1", > "batchspawner.SlurmSpawner", > dict(req_options=" -n 32 -N 1 -t 48:00:00 -p normal --mem=127000M > --gres=gpu:k40:2")), > ] > > #Configure the batch job. Make sure --output is set and explicitly set up > #the jupyterhub python environment > c.SlurmSpawner.batch_script = """#!/bin/bash > #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log > #SBATCH --job-name=spawner-jupyterhub > #SBATCH --chdir={homedir} > #SBATCH --export={keepvars} > #SBATCH --get-user-env=L > #SBATCH {options} > trap 'echo SIGTERM received' TERM > . /usr/local/jupyterhub/miniconda3/etc/profile.d/conda.sh > conda activate /usr/local/jupyterhub/jupyterhub > which jupyterhub-singleuser > {cmd} > echo "jupyterhub-singleuser ended gracefully" > """ > > On Tue, 5 May 2020 at 01:27, Lisa Kay Weihl <lwe...@bgsu.edu> wrote: > > > I have a single server with 2 cpu, 384gb memory and 4 gpu (GeForce RTX > > 2080 Ti). > > > > Use is to be for GPU ML computing and python based data science. > > > > One faculty wants jupyter notebooks, other faculty member is used to using > > CUDA for GPU but has only done it on a workstation in his lab with a GUI. > > New faculty member coming in has used nvidia-docker container for GPU (I > > think on a large cluster, we are just getting started) > > > > I'm charged with making all this work and hopefully all at once. Right now > > I'll take one thing working. > > > > So I managed to get Slurm-20.02.1 installed with CUDA-10.2 on CentOS 7 (SE > > Linux enabled). I posted once before about having trouble getting that > > combination correct and I finally worked that out. Most of the tests in the > > test suite seem to run okay. I'm trying to start with very basic Slurm > > configuration so I haven't enabled accounting. > > > > *For reference here is my slurm.conf* > > > > # slurm.conf file generated by configurator easy.html. > > > > # Put this file on all nodes of your cluster. > > > > # See the slurm.conf man page for more information. > > > > # > > > > SlurmctldHost=cs-host > > > > > > #authentication > > > > AuthType=auth/munge > > > > CacheGroups = 0 > > > > CryptoType=crypto/munge > > > > > > #Add GPU support > > > > GresTypes=gpu > > > > > > # > > > > #MailProg=/bin/mail > > > > MpiDefault=none > > > > #MpiParams=ports=#-# > > > > > > #service > > > > ProctrackType=proctrack/cgroup > > > > ReturnToService=1 > > > > SlurmctldPidFile=/var/run/slurmctld.pid > > > > #SlurmctldPort=6817 > > > > SlurmdPidFile=/var/run/slurmd.pid > > > > #SlurmdPort=6818 > > > > SlurmdSpoolDir=/var/spool/slurmd > > > > SlurmUser=slurm > > > > #SlurmdUser=root > > > > StateSaveLocation=/var/spool/slurmctld > > > > SwitchType=switch/none > > > > TaskPlugin=task/affinity > > > > # > > > > # > > > > # TIMERS > > > > #KillWait=30 > > > > #MinJobAge=300 > > > > #SlurmctldTimeout=120 > > > > SlurmdTimeout=1800 > > > > # > > > > # > > > > # SCHEDULING > > > > SchedulerType=sched/backfill > > > > SelectType=select/cons_tres > > > > SelectTypeParameters=CR_Core_Memory > > > > PriorityType=priority/multifactor > > > > PriorityDecayHalfLife=3-0 > > > > PriorityMaxAge=7-0 > > > > PriorityFavorSmall=YES > > > > PriorityWeightAge=1000 > > > > PriorityWeightFairshare=0 > > > > PriorityWeightJobSize=125 > > > > PriorityWeightPartition=1000 > > > > PriorityWeightQOS=0 > > > > # > > > > # > > > > # LOGGING AND ACCOUNTING > > > > AccountingStorageType=accounting_storage/none > > > > ClusterName=cs-host > > > > #JobAcctGatherFrequency=30 > > > > JobAcctGatherType=jobacct_gather/none > > > > SlurmctldDebug=info > > > > SlurmctldLogFile=/var/log/slurmctld.log > > > > #SlurmdDebug=info > > > > SlurmdLogFile=/var/log/slurmd.log > > > > # > > > > # > > > > # COMPUTE NODES > > > > NodeName=cs-host CPUs=24 RealMemory=385405 Sockets=2 CoresPerSocket=6 > > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:4 > > > > > > #PARTITIONS > > > > PartitionName=DEFAULT Nodes=cs-host Shared=FORCE:1 Default=YES > > MaxTime=INFINITE State=UP > > > > PartitionName=faculty Priority=10 Default=YES > > > > > > I have jupyterhub running as part of RedHat SCL. It works fine with no > > integration with Slurm. Now I'm trying to use batchspawner to start a > > server for the user. Right now I'm just trying one configuration from > > within the jupyterhub_config.py and trying to keep it simple (see below). > > > > *When I connect I get this error:* > > 500: Internal Server Error > > Error in Authenticator.pre_spawn_start: RuntimeError The Jupyter batch job > > has disappeared while pending in the queue or died immediately after > > starting. > > > > *In the jupyterhub.log:* > > > > [I 2020-05-04 19:47:58.604 JupyterHub base:707] User logged in: csadmin > > > > [I 2020-05-04 19:47:58.606 JupyterHub log:174] 302 POST /hub/login?next= > > -> /hub/spawn (csadmin@127.0.0.1) 227.13ms > > > > [I 2020-05-04 19:47:58.748 JupyterHub batchspawner:248] Spawner submitting > > job using sudo -E -u csadmin sbatch --parsable > > > > [I 2020-05-04 19:47:58.749 JupyterHub batchspawner:249] Spawner submitted > > script: > > > > #!/bin/bash > > > > #SBATCH --partition=faculty > > > > #SBATCH --time=8:00:00 > > > > #SBATCH --output=/home/csadmin/jupyterhub_slurmspawner_%j.log > > > > #SBATCH --job-name=jupyterhub-spawner > > > > #SBATCH --cpus-per-task=1 > > > > #SBATCH --chdir=/home/csadmin > > > > #SBATCH --uid=csadmin > > > > > > > > env > > > > which jupyterhub-singleuser > > > > batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0 > > > > > > > > [I 2020-05-04 19:47:58.831 JupyterHub batchspawner:252] Job submitted. > > cmd: sudo -E -u csadmin sbatch --parsable output: 7117 > > > > [W 2020-05-04 19:47:59.481 JupyterHub batchspawner:377] Job neither > > pending nor running. > > > > > > > > [E 2020-05-04 19:47:59.482 JupyterHub user:640] Unhandled error starting > > csadmin's server: The Jupyter batch job has disappeared while pending in > > the queue or died immediately after starting. > > > > [W 2020-05-04 19:47:59.518 JupyterHub web:1782] 500 GET /hub/spawn > > (127.0.0.1): Error in Authenticator.pre_spawn_start: RuntimeError The > > Jupyter batch job has disappeared while pending in the queue or died > > immediately after starting. > > > > [E 2020-05-04 19:47:59.521 JupyterHub log:166] { > > > > "X-Forwarded-Host": "localhost:8000", > > > > "X-Forwarded-Proto": "http", > > > > "X-Forwarded-Port": "8000", > > > > "X-Forwarded-For": "127.0.0.1", > > > > "Cookie": "jupyterhub-hub-login=[secret]; _xsrf=[secret]; > > jupyterhub-session-id=[secret]", > > > > "Accept-Language": "en-US,en;q=0.9", > > > > "Accept-Encoding": "gzip, deflate, br", > > > > "Referer": "http://localhost:8000/hub/login", > > > > "Sec-Fetch-Dest": "document", > > > > "Sec-Fetch-User": "?1", > > > > "Sec-Fetch-Mode": "navigate", > > > > "Sec-Fetch-Site": "same-origin", > > > > "Accept": > > "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", > > > > "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) > > AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36", > > > > "Upgrade-Insecure-Requests": "1", > > > > "Cache-Control": "max-age=0", > > > > "Connection": "close", > > > > "Host": "localhost:8000" > > > > } > > > > [E 2020-05-04 19:47:59.522 JupyterHub log:174] 500 GET /hub/spawn ( > > csadmin@127.0.0.1) 842.87ms > > > > [I 2020-05-04 19:49:05.294 JupyterHub proxy:320] Checking routes > > > > [I 2020-05-04 19:54:05.292 JupyterHub proxy:320] Checking routes > > > > > > > > *In the slurmd.log (which I don't see as helpful):* > > > > > > [2020-05-04T19:47:58.931] task_p_slurmd_batch_request: 7117 > > > > [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU input mask for node: > > 0x000003 > > > > [2020-05-04T19:47:58.931] task/affinity: job 7117 CPU final HW mask for > > node: 0x001001 > > > > [2020-05-04T19:47:58.932] _run_prolog: run job script took usec=473 > > > > [2020-05-04T19:47:58.932] _run_prolog: prolog with lock for job 7117 ran > > for 0 seconds > > > > [2020-05-04T19:47:58.932] Launching batch job 7117 for UID 1001 > > > > [2020-05-04T19:47:58.967] [7117.batch] task_p_pre_launch: Using > > sched_affinity for tasks > > > > [2020-05-04T19:47:58.978] [7117.batch] sending > > REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:32512 > > > > [2020-05-04T19:47:58.982] [7117.batch] done with job > > > > > > *In the jupyterhub_config.py (just the part for batchspawner):* > > > > > > > > > > > > > > > > * c = get_config() c.JupyterHub.spawner_class = > > 'batchspawner.SlurmSpawner' # Even though not used, needed to register > > batchspawner interface import batchspawner c.Spawner.http_timeout = 120 > > c.SlurmSpawner.req_nprocs = '1' c.SlurmSpawner.req_runtime = '8:00:00' > > c.SlurmSpawner.req_partition = 'faculty' c.SlurmSpawner.req_memory = > > '128gb' c.SlurmSpawner.start_timeout = 240 c.SlurmSpawner.batch_script = > > '''#!/bin/bash #SBATCH --partition={partition} #SBATCH --time={runtime} > > #SBATCH --output={homedir}/jupyterhub_slurmspawner_%j.log #SBATCH > > --job-name=jupyterhub-spawner #SBATCH --cpus-per-task={nprocs} #SBATCH > > --chdir=/home/{username} #SBATCH --uid={username} env which > > jupyterhub-singleuser {cmd} ''' * > > > > I will admit that I don't understand all of this completely as I haven't > > written a lot of bash scripts. I'm getting that some of the things in {} > > are environment variables and others come from within this file and it > > seems they must be specifically defined in the batchspawner software > > somewhere. > > > > Is the last piece trying to find the path of jupyterhub-singleuser and > > then launch it with {cmd} > > > > Feel free to tell me to go read the docs but be gentle ? Because of the > > request to make ALL of this work ASAP I've been skimming and trying to pick > > up as much as I can and then going off examples trying to make this work. > > I have a feeling that this command: sudo -E -u csadmin sbatch --parsable > > output: 7117 > > is what is incorrect and causing the problems. Clearly something isn't > > starting that should be. > > > > If you can shed any light on anything or any info online that might help > > me I'd much appreciate it. I'm really beating my head over this one and I > > know inexperience isn't helping. > > > > When I figure out this simple config then I want to have the profile where > > I can setup several settings and have the user select. > > > > One other basic question. I'm assuming in Slurm language my server is > > considered to have 24 CPU with the cores and threads so that any of the > > Slurm settings that refer to things like CPU per task I could specify up to > > 24 if a user wanted. Also, in this case the node will always be 1 since we > > only have 1 server. > > > > Thanks! > > > > *************************************************************** > > > > Lisa Weihl *Systems Administrator* > > > > > > *Computer Science, Bowling Green State University *Tel: (419) 372-0116 > > | Fax: (419) 372-8061 > > lwe...@bgsu.edu > > http://www.bgsu.edu/? > > > > > -- > Dr. Guy Coates > +44(0)7801 710224 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.schedmd.com%2Fpipermail%2Fslurm-users%2Fattachments%2F20200505%2F2a15c990%2Fattachment.htm&data=02%7C01%7Clweihl%40bgsu.edu%7C322dc8435ab642ef25aa08d7f0d29d44%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637242659703767084&sdata=dK0HAGC6RxwkysiGgYXwtYA1dxL7HPIwnEcF0LS2Nn8%3D&reserved=0> > > End of slurm-users Digest, Vol 31, Issue 8 > ******************************************