Re: [slurm-users] Do not upgrade mysql to 5.7.30!

2020-05-07 Thread Bill Broadley
On 5/6/20 11:30 AM, Dustin Lang wrote: Hi, Ubuntu has made mysql 5.7.30 the default version.  At least with Ubuntu 16.04, this causes severe problems with Slurm dbd (v 17.x, 18.x, and 19.x; not sure about 20). I can confirm that kills slurmdbd on ubuntu 18.04 as well. I had compiled slurm

Re: [slurm-users] 19.05 and GPUs vs GRES

2019-09-05 Thread Bill Broadley
Anyone know if the new GPU support allows having a different number of GPUs per node? I found: https://www.ch.cam.ac.uk/computing/slurm-usage Which mentions "SLURM does not support having varying numbers of GPUs per node in a job yet." I have a user with a particularly flexible code that would

Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread Bill Broadley
On 5/15/19 12:34 AM, Barbara Krašovec wrote: > It could be a problem with ARP cache. > > If the number of devices approaches 512, there is a kernel limitation in > dynamic > ARP-cache size and it can result in the loss of connectivity between nodes. We have 162 compute nodes, a dozen or so file

[slurm-users] Nodes not responding... how does slurm track it?

2019-05-14 Thread Bill Broadley
My latest addition to a cluster results in a group of the same nodes periodically getting listed as "not-responding" and usually (but not always) recovering. I increased logging up to debug3 and see messages like: [2019-05-14T17:09:25.247] debug: Spawning ping agent for bigmem[1-9],bm[1,7,9-13

Re: [slurm-users] Slurmctld 18.08.1 and 18.08.3 segfault

2018-11-14 Thread Bill Broadley
On 11/13/18 9:39 PM, Kilian Cavalotti wrote: > Hi Bill, > There are a couple mentions of the same backtrace on the bugtracker, > but that was a long time ago (namely > https://bugs.schedmd.com/show_bug.cgi?id=1557 and > https://bugs.schedmd.com/show_bug.cgi?id=1660, for Slurm 14.11). Weird > to see

[slurm-users] Slurmctld 18.08.1 and 18.08.3 segfault

2018-11-13 Thread Bill Broadley
After being up since the second week in Oct or so, yesterday our slurm controller started segfaultings. It was compiled/run on ubuntu 16.04.1. Nov 12 14:31:48 nas-11-1 kernel: [2838306.311552] srvcn[9111]: segfault at 58 ip 004b51fa sp 7fbe270efb70 error 4 in slurmctld[40+eb000

Re: [slurm-users] Cgroups and swap with 18.08.1?

2018-10-19 Thread Bill Broadley
On 10/16/18 3:38 AM, Bjørn-Helge Mevik wrote: > Just a tip: Make sure that the kernel has support for constraining swap > space. I believe we once had to reinstall one of our clusters once > because we had forgotten to check that. I tried starting slurmd with -D -v -v -v and got: slurmd: debug:

[slurm-users] Cgroups and swap with 18.08.1?

2018-10-15 Thread Bill Broadley
Greetings, I'm using ubuntu-18.04 and slurm-18.08.1 compiled from source. I followed the directions on: https://slurm.schedmd.com/cgroups.html And: https://slurm.schedmd.com/cgroup.conf.html That resulted in: $ cat slurm.conf | egrep -i "cgroup|CR_" ProctrackType=proctrack/cgroup TaskPlugin=t

[slurm-users] PMIX and slurm failure (and fix).

2018-05-17 Thread Bill Broadley
Greetings all, Just wanted to mention I build building the newest slurm on Ubuntu 18.04. Gcc-7.3 is the default compiler, which means that the various dependencies (munge, libevent, hwloc, netloc, pmix, etc) are already available and built with gcc-7.3. I carefully built slurm-17.11.6 + openmpi

Re: [slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging

2018-05-08 Thread Bill Broadley
On 05/08/2018 05:33 PM, Christopher Samuel wrote: > On 09/05/18 10:23, Bill Broadley wrote: > >> It's possible of course that it's entirely an openmpi problem, I'll >> be investigating and posting there if I can't find a solution. > > One of the cha

[slurm-users] Slurm-17.11.5 + Pmix-2.1.1/Debugging

2018-05-08 Thread Bill Broadley
Greetings all, I have slurm-17.11.5, pmix-1.2.4, and openmpi-3.0.1 working on several clusters. I find srun handy for things like: bill@headnode:~/src/relay$ srun -N 2 -n 2 -t 1 ./relay 1 c7-18 c7-19 size= 1, 16384 hops, 2 nodes in 0.03 sec ( 2.00 us/hop) 1953 KB/sec Building was st