have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7, fixed in 20.06.1
On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk <pbr...@uga.edu> wrote: > Hi all: > > (I hope plague and weather are being visibly less than maximally cruel > to you all.) > > In short, I was trying to exempt a node from NVML Autodetect, and > apparently introduced a syntax error in gres.conf. This is not an > urgent matter for us now, but I'm curious what went wrong. Thanks for > lending any eyes to this! > > More info: > > Slurm 20.02.6, CentOS 7. > > We've historically had only this in our gres.conf: > AutoDetect=nvml > > Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its > NodeName entry (GPU models vary across them). > > I wanted to exempt one GPU node from the autodetect (was curious about > the presence or absence of the GPU model subtype designation, > e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled > after 'gres.conf' man page): > > AutoDetect=nvml > NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0 > > I restarted slurmctld, then "scontrol reconfigure". Each node got a > fatal error parsing gres.conf, causing RPC failure between slurmctld > and nodes, causing slurmctld to consider the nodes failed. > > Here's how it looked to slurmctld: > > [2021-02-04T13:36:30.482] backfill: Started JobId=1469772_3(1473148) in > batch on ra3-6 > [2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a different > slurm.conf than the slurmctld. This could cause issues with communication > and functionality. Please review both files and make sure they are the > same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your > slurm.conf. > [2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6 RPC:REQUEST_PING > : Communication connection failure > [2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure of node > ra3-6 > > And to the slurmd's : > > [2021-02-04T15:14:50.730] Message aggregation disabled > [2021-02-04T15:14:50.742] error: Parsing error at unrecognized key: > AutoDetect > [2021-02-04T15:14:50.742] error: Parse error in file > /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off Name=gpu > File=/dev/nvidia0" > [2021-02-04T15:14:50.742] fatal: error opening/reading > /var/lib/slurmd/conf-cache/gres.conf > > Reverting to the original, one-line gres.conf reverted the cluster to > production state. > > -- > Paul Brunk, system administrator > Georgia Advanced Computing Resource Center > Enterprise IT Svcs, the University of Georgia > > >