Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Will Dennis
On Wednesday, May 26, 2021 at 2:49 PM Ole Holm Nielsen said: > I recommend strongly to read the SchedMD presentations in the > [snipped] page, especially the "Field > notes" documents. The latest one is "Field Notes 4: From The Frontlines > of Slurm Support", Jason Booth, SchedMD. Yes, thanks fo

Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Will Dennis
Yup, in our case, it would be 20.11.5 -> 20.11.7. From: slurm-users on behalf of Paul Edmon Date: Wednesday, May 26, 2021 at 2:59 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Upgrading slurm - can I do it while jobs running? We generally pause scheduling during upgrades out

[slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Will Dennis
Hi all, About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm// which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was fixed in 20.11.7) but I have resear

Re: [slurm-users] Configless mode enabling issue

2021-05-07 Thread Will Dennis
Thank you for the reply, Will! The slurm.conf file only has one line in it: AutoDetect=nvml During my debug, I copied this file from the GPU node to the controller. But, that's when I noticed that the node w/o a GPU then crashed on startup. David On Fri, May 7, 2021 at 12:14 PM Will D

Re: [slurm-users] Configless mode enabling issue

2021-05-07 Thread Will Dennis
Hi David, What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect via nvml? In configless the slurm.conf, gres.conf, etc is just maintained on the controller, and the worker nodes get it from there automatically (you don’t want those files on the worker nodes.) If you need to s

Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Will Dennis
Sorry, obvs wasn’t ready to send that last message yet… Our issue is the shared storage is via NFS, and the “fast storage in limited supply” is only local on each node. Hence the need to copy it over from NFS (and then remove it when finished with it.) I also wanted the copy & remove to be diff

Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Will Dennis
If you've got other fast storage in limited supply that can be used for data that can be staged, then by all means use it, but consider whether you want batch cpu cores tied up with the wall time of transferring the data. This could easily be done on a time-shared frontend login node from which

Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Will Dennis
What I mean by “scratch” space is indeed local persistent storage in our case; sorry if my use of “scratch space” is already a generally-known Slurm concept I don’t understand, or something like /tmp… That’s why my desired workflow is to “copy data locally / use data from copy / remove local cop

[slurm-users] Staging data on the nodes one will be processing on via sbatch

2021-04-03 Thread Will Dennis
Hi all, We have various NFS servers that contain the data that our researchers want to process. These are mounted on our Slurm clusters on well-known paths. Also, the nodes have local fast scratch disk on another well-known path. We do not have any distributed file systems in use (Our Slurm clu

Re: [slurm-users] Upgrade from Ubuntu 18.04 to 20.04

2020-03-16 Thread Will Dennis
Hi Stefan, I have not been able to find any 18.08.x PPAs; I myself have backported the latest Debian HPC Team release of 19.05.5 into my PPA - https://launchpad.net/~wdennis/+archive/ubuntu/dhpc-backports I have also created local packages of 18.08.6.2, but only for Ubuntu 16.04, for my own us

[slurm-users] Upgrade paths

2020-03-11 Thread Will Dennis
Hi all, I have one cluster running v16.05.4 that I would like to upgrade if possible to 19.05.5; it was installed via a .deb package I created back in 2016. I have located a 17.11.7 Ubuntu PPA (https://launchpad.net/~jonathonf/+archive/ubuntu/slurm) and have myself recently put up one for 19.0

[slurm-users] Nodes going into drain because of "Kill task failed"

2019-10-22 Thread Will Dennis
Hi all, I have a number of nodes on one of my 17.11.7 clusters in drain mode on account of reason: "Kill task failed” I see the following in slurmd.log — [2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15 CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT *** [2019-10-17T2

[slurm-users] Statistics on node utilization?

2019-10-16 Thread Will Dennis
Hi all, We run a few Slurm clusters here, all using SlurmDBD to store job history info. I also utilize Open XDMoD (http://open.xdmod.org/) to run statistics on the jobs. However, it seems that XDMoD does not provide node utilization statistics, unless my XDMoD isn’t configured somehow to do tha

[slurm-users] Gracefully shutting down cluster

2019-10-03 Thread Will Dennis
Hi all, I want to be able to gracefully shut down Slurm and then the node itself with a command that affects the entire cluster. It is my current understanding that I can set the “RebootProgram” param in slum.conf to be a command, and then trigger the shutdown via “scontrol reboot_nodes” which

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-17 Thread Will Dennis
lurm-users-boun...@lists.schedmd.com] On Behalf Of Will Dennis Sent: Wednesday, July 17, 2019 12:56 PM To: Slurm User Community List Subject: Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state Not thinking that the server (which runs both the Slurm controller daemon and the DB) is the issue

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-17 Thread Will Dennis
taying in "RUNNING" state On 7/17/19 12:26 AM, Chris Samuel wrote: > On 16/7/19 11:43 am, Will Dennis wrote: > >> [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full >> (20140), discarding DBD_STEP_START:1442 request > > So it looks like your slurmd

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-16 Thread Will Dennis
cctmgr show runaway" was nil. A few minutes later however, "sacctmgr show runaway" had entries again. If someone knows what else I might try to isolate/resolve this issue, please kindly assist... From: Will Dennis Sent: Tuesday, July 16, 2019 2:43 PM To: slurm-users@lists.schedmd.

[slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-16 Thread Will Dennis
Hi all, Was looking at the running jobs on one groups cluster, and saw there was an insane amount of "running" jobs when I did a sacct -X -s R; then looked at output of squeue, and found a much more reasonable number... root@slurm-controller1:/ # sacct -X -p -s R | wc -l 8895 root@ slurm-contro

Re: [slurm-users] Slurm database error messages (redux)

2019-05-09 Thread Will Dennis
nderstand how to fix this? -Original Message----- From: Will Dennis Sent: Tuesday, May 07, 2019 11:01 AM To: slurm-users@lists.schedmd.com Subject: Slurm database error messages (redux) Hi all, We had to restart the slurmdbd service on one of our clusters running Slurm 17.11.7 yesterd

[slurm-users] Slurm database error messages (redux)

2019-05-07 Thread Will Dennis
Hi all, We had to restart the slurmdbd service on one of our clusters running Slurm 17.11.7 yesterday, since folks were experiencing errors with job scheduling, and running 'sacct': - $ sacct -X -p -o jobid,jobname,user,partition%-30,nodelist,alloccpus,reqmem,cputime,qos,state,exitcode,All

Re: [slurm-users] Can one specify attributes on a GRES resource?

2019-03-22 Thread Will Dennis
ot;PRIu64")", context_ptr->gres_type, gres_data->gres_cnt_found, gres_data->gres_cnt_config); } rc = EINVAL; } Where is the "gres_cnt_found" value is b

Re: [slurm-users] Can one specify attributes on a GRES resource?

2019-03-21 Thread Will Dennis
: [slurm-users] Can one specify attributes on a GRES resource? On 21/3/19 7:39 pm, Will Dennis wrote: > Why does it think that the "gres/gpu_mem_per_card" count is 0? How can I fix > this? Did you remember to distribute gres.conf as well to the nodes? -- Chris Samuel : http:/

Re: [slurm-users] Can one specify attributes on a GRES resource?

2019-03-21 Thread Will Dennis
I tried doing this as follows: Node's gres.conf: ## # Slurm's Generic Resource (GRES) configuration file ## Name=gpu File=/dev/nvidia0 Type=1050TI Name=gpu_mem_per_card C

[slurm-users] Can one specify attributes on a GRES resource?

2019-03-15 Thread Will Dennis
Hi all, I currently have features specified on my GPU-equipped nodes as follows: GPUMODEL_1050TI,GPUCHIP_GP107,GPUARCH_PASCAL,GPUMEM_4GB,GPUCUDACORES_768 or GPUMODEL_TITANV,GPUCHIP_GV100,GPUARCH_VOLTA,GPUMEM_12GB,GPUCUDACORES_5120,GPUTENSORCORES_640 The "GPUMEM" and "GPU[CUDA|TENSOR]CORES" tags a

[slurm-users] Fairshare - root user

2019-02-27 Thread Will Dennis
Looking at output of 'sshare", I see: root@myserver:~# sshare -l Account User RawShares NormSharesRawUsage NormUsage EffectvUsage FairShare -- -- --- --- --- - -- root

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Yes, we've thought about using FS-Cache, but it doesn't help on the first read-in, and the cache eviction may affect subsequent read attempts... (different people are using different data sets, and the cache will probably not hold all of them at the same time...) On Friday, February 22, 2019 2

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
(replies inline) On Friday, February 22, 2019 1:03 PM, Alex Chekholko said: >Hi Will, > >If your bottleneck is now your network, you may want to upgrade the network. >Then the disks will become your bottleneck :) > Via network bandwidth analysis, it's not really network that's the problem...

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Thanks for the reply, Ray. For one of my groups, on the GPU servers in their cluster, I have provided a RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path ("/mnt/local" for historical reasons) that they can use for local scratch space. Their other servers in the cluster ha

[slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Will Dennis
Hi folks, Not directly Slurm-related, but... We have a couple of research groups that have large data sets they are processing via Slurm jobs (deep-learning applications) and are presently consuming the data via NFS mounts (both groups have 10G ethernet interconnects between the Slurm nodes and

Re: [slurm-users] SLURM docs: HTML title should be same as page title

2019-02-22 Thread Will Dennis
Yes! I always have E_WAYTOOMANY tabs open on my Chrome browser, and using "TooManyTabs" plugin and searching for "Slurm" I see a whole bunch of "Slurm Workload Manager" entries, then have to guess which one is what page... -Original Message- From: slurm-users [mailto:slurm-users-boun...@

Re: [slurm-users] Defining new Gres types on nodes

2018-09-24 Thread Will Dennis
On Mon, Sep 24, 2018 at 3:53 PM "Eli V" wrote: >I'm not using the :no_consume syntax, simply Gres=name:#,y:z,... >Of course after changes copy gres & slurm.conf to all nodes and scontrol >reconfigure works great for me. We are using ":no_consume" because we don't care how Slurm processes use/shar

[slurm-users] Defining new Gres types on nodes

2018-09-24 Thread Will Dennis
Hi all, We want to add in some Gres resource types pertaining to GPUs (amount of GPU memory and CUDA cores) on some of our nodes. So we added the following params into the 'gres.conf' on the nodes that have GPUs: Name=gpu_mem Count=<#>G Name=gpu_cores Count=<#> And in slurm.conf: GresTypes=g

[slurm-users] Spec-ing a Slurm DB server

2018-07-19 Thread Will Dennis
require?) 2) How to port over the existing Slurm DBD database to the newer server? Pointers to existing docs that answer these questions gratefully accepted (I looked, but didn't find any that addressed my concerns.) Thanks! Will Dennis NEC Laboratories America

Re: [slurm-users] Controller / backup controller q's

2018-05-25 Thread Will Dennis
On Friday, May 25, 2018 5:31 AM, Pär Lindfors wrote: > Time to start upgrading to Ubuntu 18.04 now then? :-) Not yet time for us... There's problems with U18.04 that render it unusable for our environment. > For a 10 node cluster it might make more sense to run slurmctld and slurmdbd > on the

Re: [slurm-users] Controller / backup controller q's

2018-05-25 Thread Will Dennis
his is a classic case in point. Forgive me if I have misunderstood your setup. On 25 May 2018 at 11:30, Pär Lindfors mailto:pa...@nsc.liu.se>> wrote: Hi Will, On 05/24/2018 05:43 PM, Will Dennis wrote: > (we were using CentOS 7.x > originally, now the compute nodes ar

[slurm-users] Controller / backup controller q's

2018-05-24 Thread Will Dennis
Hi all, We are building out a new Slurm cluster for a research group here; unfortunately this has taken place over a long period of time, and there's been some architectural changes made in the middle, most importantly the host OS on the Slurm nodes (we were using CentOS 7.x originally, now the

Re: [slurm-users] After Each slurm Run, I Need to Reinstall slurm

2018-05-05 Thread Will Dennis
A few thoughts… 1) I am not sure Slurm can run “all-in-one” with controller/worker/acctg-db all on one host… If anyone else know if this is doable, please chime in (I actually have a request to do this for a single machine at work, where the researchers want to have many folks share a single GP

Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-04 Thread Will Dennis
Yes! That was it. I needed to install ‘libpam0g-dev’ (pkg description: Development files for PAM) Then after running “./configure, make, make contrib” again – pkgbuilder@mlbuild02:~/test-build/slurm-16.05.4$ find . -name "pam_slurm.so" -print ./contribs/pam/.libs/pam_slurm.so pkgbuilder@mlbuil

Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-04 Thread Will Dennis
That’s what I’m having a problem with – how to do this? (I don’t build software often, so not a pro at this...) My contrib/pam folder contains: pkgbuilder@mlbuild02:~/test-build/slurm-16.05.4/contribs/pam$ ls -la total 92 drwxr-xr-x 3 pkgbuilder pkgbuilder 4096 May 4 14:11 . drwxr-xr-x 19 pkg

Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-04 Thread Will Dennis
I just tried unpacking the original archive, and running “./configure, make, make contrib” but no luck – still no ‘pam_slurm.so’ file created... What am I missing here? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Will Dennis Sent: Friday, May 04, 2018 2:50 PM

Re: [slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-04 Thread Will Dennis
e: Nothing to be done for 'all'.”) From: Will Dennis Sent: Thursday, May 03, 2018 11:07 PM To: slurm-users@lists.schedmd.com Subject: Finding / compiling "pam_slurm.so" for Ubuntu 16.04 Hello everyone, Back a year ago or so, I started a new SLURM cluster, and had produ

[slurm-users] Finding / compiling "pam_slurm.so" for Ubuntu 16.04

2018-05-03 Thread Will Dennis
, how to compile it? Thanks, Will Dennis Sr. Systems Administrator, NEC Laboratories America