On Wednesday, May 26, 2021 at 2:49 PM Ole Holm Nielsen said:
> I recommend strongly to read the SchedMD presentations in the
> [snipped] page, especially the "Field
> notes" documents. The latest one is "Field Notes 4: From The Frontlines
> of Slurm Support", Jason Booth, SchedMD.
Yes, thanks fo
Yup, in our case, it would be 20.11.5 -> 20.11.7.
From: slurm-users on behalf of Paul
Edmon
Date: Wednesday, May 26, 2021 at 2:59 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Upgrading slurm - can I do it while jobs running?
We generally pause scheduling during upgrades out
Hi all,
About to embark on my first Slurm upgrade (building from source now, into a
versioned path /opt/slurm// which is then symlinked to
/opt/slurm/current/ for the “in-use” one…) This is a new cluster, running
20.11.5 (which we now know has a CVE that was fixed in 20.11.7) but I have
resear
Thank you for the reply, Will!
The slurm.conf file only has one line in it:
AutoDetect=nvml
During my debug, I copied this file from the GPU node to the controller. But,
that's when I noticed that the node w/o a GPU then crashed on startup.
David
On Fri, May 7, 2021 at 12:14 PM Will D
Hi David,
What is the gres.conf on the controller’s /etc/slurm ? Is it autodetect via
nvml?
In configless the slurm.conf, gres.conf, etc is just maintained on the
controller, and the worker nodes get it from there automatically (you don’t
want those files on the worker nodes.) If you need to s
Sorry, obvs wasn’t ready to send that last message yet…
Our issue is the shared storage is via NFS, and the “fast storage in limited
supply” is only local on each node. Hence the need to copy it over from NFS
(and then remove it when finished with it.)
I also wanted the copy & remove to be diff
If you've got other fast storage in limited supply that can be used for data
that can be staged, then by all means use it, but consider whether you want
batch cpu cores tied up with the wall time of transferring the data. This could
easily be done on a time-shared frontend login node from which
What I mean by “scratch” space is indeed local persistent storage in our case;
sorry if my use of “scratch space” is already a generally-known Slurm concept I
don’t understand, or something like /tmp… That’s why my desired workflow is to
“copy data locally / use data from copy / remove local cop
Hi all,
We have various NFS servers that contain the data that our researchers want to
process. These are mounted on our Slurm clusters on well-known paths. Also, the
nodes have local fast scratch disk on another well-known path. We do not have
any distributed file systems in use (Our Slurm clu
Hi Stefan,
I have not been able to find any 18.08.x PPAs; I myself have backported the
latest Debian HPC Team release of 19.05.5 into my PPA -
https://launchpad.net/~wdennis/+archive/ubuntu/dhpc-backports
I have also created local packages of 18.08.6.2, but only for Ubuntu 16.04, for
my own us
Hi all,
I have one cluster running v16.05.4 that I would like to upgrade if possible to
19.05.5; it was installed via a .deb package I created back in 2016. I have
located a 17.11.7 Ubuntu PPA
(https://launchpad.net/~jonathonf/+archive/ubuntu/slurm) and have myself
recently put up one for 19.0
Hi all,
I have a number of nodes on one of my 17.11.7 clusters in drain mode on account
of reason: "Kill task failed”
I see the following in slurmd.log —
[2019-10-17T20:06:43.027] [34443.0] error: *** STEP 34443.0 ON server15
CANCELLED AT 2019-10-17T20:06:43 DUE TO TIME LIMIT ***
[2019-10-17T2
Hi all,
We run a few Slurm clusters here, all using SlurmDBD to store job history info.
I also utilize Open XDMoD (http://open.xdmod.org/) to run statistics on the
jobs. However, it seems that XDMoD does not provide node utilization
statistics, unless my XDMoD isn’t configured somehow to do tha
Hi all,
I want to be able to gracefully shut down Slurm and then the node itself with a
command that affects the entire cluster. It is my current understanding that I
can set the “RebootProgram” param in slum.conf to be a command, and then
trigger the shutdown via “scontrol reboot_nodes” which
lurm-users-boun...@lists.schedmd.com] On Behalf Of
Will Dennis
Sent: Wednesday, July 17, 2019 12:56 PM
To: Slurm User Community List
Subject: Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state
Not thinking that the server (which runs both the Slurm controller daemon and
the DB) is the issue
taying in "RUNNING" state
On 7/17/19 12:26 AM, Chris Samuel wrote:
> On 16/7/19 11:43 am, Will Dennis wrote:
>
>> [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full
>> (20140), discarding DBD_STEP_START:1442 request
>
> So it looks like your slurmd
cctmgr show runaway" was nil. A few minutes later
however, "sacctmgr show runaway" had entries again.
If someone knows what else I might try to isolate/resolve this issue, please
kindly assist...
From: Will Dennis
Sent: Tuesday, July 16, 2019 2:43 PM
To: slurm-users@lists.schedmd.
Hi all,
Was looking at the running jobs on one groups cluster, and saw there was an
insane amount of "running" jobs when I did a sacct -X -s R; then looked at
output of squeue, and found a much more reasonable number...
root@slurm-controller1:/ # sacct -X -p -s R | wc -l
8895
root@ slurm-contro
nderstand how to fix this?
-Original Message-----
From: Will Dennis
Sent: Tuesday, May 07, 2019 11:01 AM
To: slurm-users@lists.schedmd.com
Subject: Slurm database error messages (redux)
Hi all,
We had to restart the slurmdbd service on one of our clusters running Slurm
17.11.7 yesterd
Hi all,
We had to restart the slurmdbd service on one of our clusters running Slurm
17.11.7 yesterday, since folks were experiencing errors with job scheduling,
and running 'sacct':
-
$ sacct -X -p -o
jobid,jobname,user,partition%-30,nodelist,alloccpus,reqmem,cputime,qos,state,exitcode,All
ot;PRIu64")",
context_ptr->gres_type,
gres_data->gres_cnt_found,
gres_data->gres_cnt_config);
}
rc = EINVAL;
}
Where is the "gres_cnt_found" value is b
: [slurm-users] Can one specify attributes on a GRES resource?
On 21/3/19 7:39 pm, Will Dennis wrote:
> Why does it think that the "gres/gpu_mem_per_card" count is 0? How can I fix
> this?
Did you remember to distribute gres.conf as well to the nodes?
--
Chris Samuel : http:/
I tried doing this as follows:
Node's gres.conf:
##
# Slurm's Generic Resource (GRES) configuration file
##
Name=gpu File=/dev/nvidia0 Type=1050TI
Name=gpu_mem_per_card C
Hi all,
I currently have features specified on my GPU-equipped nodes as follows:
GPUMODEL_1050TI,GPUCHIP_GP107,GPUARCH_PASCAL,GPUMEM_4GB,GPUCUDACORES_768
or
GPUMODEL_TITANV,GPUCHIP_GV100,GPUARCH_VOLTA,GPUMEM_12GB,GPUCUDACORES_5120,GPUTENSORCORES_640
The "GPUMEM" and "GPU[CUDA|TENSOR]CORES" tags a
Looking at output of 'sshare", I see:
root@myserver:~# sshare -l
Account User RawShares NormSharesRawUsage NormUsage
EffectvUsage FairShare
-- -- --- --- ---
- --
root
Yes, we've thought about using FS-Cache, but it doesn't help on the first
read-in, and the cache eviction may affect subsequent read attempts...
(different people are using different data sets, and the cache will probably
not hold all of them at the same time...)
On Friday, February 22, 2019 2
(replies inline)
On Friday, February 22, 2019 1:03 PM, Alex Chekholko said:
>Hi Will,
>
>If your bottleneck is now your network, you may want to upgrade the network.
>Then the disks will become your bottleneck :)
>
Via network bandwidth analysis, it's not really network that's the problem...
Thanks for the reply, Ray.
For one of my groups, on the GPU servers in their cluster, I have provided a
RAID-0 md array of multi-TB SSDs (for I/O speed) mounted on a given path
("/mnt/local" for historical reasons) that they can use for local scratch
space. Their other servers in the cluster ha
Hi folks,
Not directly Slurm-related, but... We have a couple of research groups that
have large data sets they are processing via Slurm jobs (deep-learning
applications) and are presently consuming the data via NFS mounts (both groups
have 10G ethernet interconnects between the Slurm nodes and
Yes! I always have E_WAYTOOMANY tabs open on my Chrome browser, and using
"TooManyTabs" plugin and searching for "Slurm" I see a whole bunch of "Slurm
Workload Manager" entries, then have to guess which one is what page...
-Original Message-
From: slurm-users [mailto:slurm-users-boun...@
On Mon, Sep 24, 2018 at 3:53 PM "Eli V" wrote:
>I'm not using the :no_consume syntax, simply Gres=name:#,y:z,...
>Of course after changes copy gres & slurm.conf to all nodes and scontrol
>reconfigure works great for me.
We are using ":no_consume" because we don't care how Slurm processes use/shar
Hi all,
We want to add in some Gres resource types pertaining to GPUs (amount of GPU
memory and CUDA cores) on some of our nodes. So we added the following params
into the 'gres.conf' on the nodes that have GPUs:
Name=gpu_mem Count=<#>G
Name=gpu_cores Count=<#>
And in slurm.conf:
GresTypes=g
require?)
2) How to port over the existing Slurm DBD database to the newer server?
Pointers to existing docs that answer these questions gratefully accepted (I
looked, but didn't find any that addressed my concerns.)
Thanks!
Will Dennis
NEC Laboratories America
On Friday, May 25, 2018 5:31 AM, Pär Lindfors wrote:
> Time to start upgrading to Ubuntu 18.04 now then? :-)
Not yet time for us... There's problems with U18.04 that render it unusable for
our environment.
> For a 10 node cluster it might make more sense to run slurmctld and slurmdbd
> on the
his is a classic case in point.
Forgive me if I have misunderstood your setup.
On 25 May 2018 at 11:30, Pär Lindfors
mailto:pa...@nsc.liu.se>> wrote:
Hi Will,
On 05/24/2018 05:43 PM, Will Dennis wrote:
> (we were using CentOS 7.x
> originally, now the compute nodes ar
Hi all,
We are building out a new Slurm cluster for a research group here;
unfortunately this has taken place over a long period of time, and there's been
some architectural changes made in the middle, most importantly the host OS on
the Slurm nodes (we were using CentOS 7.x originally, now the
A few thoughts…
1) I am not sure Slurm can run “all-in-one” with controller/worker/acctg-db all
on one host… If anyone else know if this is doable, please chime in (I actually
have a request to do this for a single machine at work, where the researchers
want to have many folks share a single GP
Yes! That was it. I needed to install ‘libpam0g-dev’ (pkg description:
Development files for PAM)
Then after running “./configure, make, make contrib” again –
pkgbuilder@mlbuild02:~/test-build/slurm-16.05.4$ find . -name "pam_slurm.so"
-print
./contribs/pam/.libs/pam_slurm.so
pkgbuilder@mlbuil
That’s what I’m having a problem with – how to do this? (I don’t build software
often, so not a pro at this...)
My contrib/pam folder contains:
pkgbuilder@mlbuild02:~/test-build/slurm-16.05.4/contribs/pam$ ls -la
total 92
drwxr-xr-x 3 pkgbuilder pkgbuilder 4096 May 4 14:11 .
drwxr-xr-x 19 pkg
I just tried unpacking the original archive, and running “./configure, make,
make contrib” but no luck – still no ‘pam_slurm.so’ file created... What am I
missing here?
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of
Will Dennis
Sent: Friday, May 04, 2018 2:50 PM
e: Nothing to be done for 'all'.”)
From: Will Dennis
Sent: Thursday, May 03, 2018 11:07 PM
To: slurm-users@lists.schedmd.com
Subject: Finding / compiling "pam_slurm.so" for Ubuntu 16.04
Hello everyone,
Back a year ago or so, I started a new SLURM cluster, and had produ
, how to compile it?
Thanks,
Will Dennis
Sr. Systems Administrator,
NEC Laboratories America
42 matches
Mail list logo