Re: [slurm-users] [External] Re: exempting a node from Gres Autodetect

Prentice Bisbal Tue, 23 Feb 2021 16:36:46 -0800

I don't see how that bug is related. That bug is about requiring thelibnvidia-ml.so library for an RPM that was built with NVML Autodetectenabled. His problem is the opposite - he's already using NVMLautodetect, but wants to disable that feature on a single node, where itlooks like that node isn't using RPMs with NVML support.


Prentice


On 2/19/21 3:43 PM, Robert Kudyba wrote:

have you seen this? https://bugs.schedmd.com/show_bug.cgi?id=7919#c7<https://bugs.schedmd.com/show_bug.cgi?id=7919#c7>, fixed in 20.06.1

On Fri, Feb 19, 2021 at 11:34 AM Paul Brunk <pbr...@uga.edu<mailto:pbr...@uga.edu>> wrote:


    Hi all:

    (I hope plague and weather are being visibly less than maximally cruel
    to you all.)

    In short, I was trying to exempt a node from NVML Autodetect, and
    apparently introduced a syntax error in gres.conf.  This is not an
    urgent matter for us now, but I'm curious what went wrong. Thanks for
    lending any eyes to this!

    More info:

    Slurm 20.02.6, CentOS 7.

    We've historically had only this in our gres.conf:
    AutoDetect=nvml

    Each of our GPU nodes has e.g. 'Gres=gpu:V100:1' as part of its
    NodeName entry (GPU models vary across them).

    I wanted to exempt one GPU node from the autodetect (was curious about
    the presence or absence of the GPU model subtype designation,
    e.g. 'V100' vs. 'v100s'), so I changed gres.conf to this (modelled
    after 'gres.conf' man page):

    AutoDetect=nvml
    NodeName=a1-10 AutoDetect=off Name=gpu File=/dev/nvidia0

    I restarted slurmctld, then "scontrol reconfigure".  Each node got a
    fatal error parsing gres.conf, causing RPC failure between slurmctld
    and nodes, causing slurmctld to consider the nodes failed.

    Here's how it looked to slurmctld:

    [2021-02-04T13:36:30.482] backfill: Started
    JobId=1469772_3(1473148) in batch on ra3-6
    [2021-02-04T15:14:48.642] error: Node ra3-6 appears to have a
    different slurm.conf than the slurmctld.  This could cause issues
    with communication and functionality.  Please review both files
    and make sure they are the same.  If this is expected ignore, and
    set DebugFlags=NO_CONF_HASH in your slurm.conf.
    [2021-02-04T15:25:40.258] agent/is_node_resp: node:ra3-6
    RPC:REQUEST_PING : Communication connection failure
    [2021-02-04T15:39:49.046] requeue job JobId=1443912 due to failure
    of node ra3-6

    And to the slurmd's :

    [2021-02-04T15:14:50.730] Message aggregation disabled
    [2021-02-04T15:14:50.742] error: Parsing error at unrecognized
    key: AutoDetect
    [2021-02-04T15:14:50.742] error: Parse error in file
    /var/lib/slurmd/conf-cache/gres.conf line 2: " AutoDetect=off
    Name=gpu File=/dev/nvidia0"
    [2021-02-04T15:14:50.742] fatal: error opening/reading
    /var/lib/slurmd/conf-cache/gres.conf

    Reverting to the original, one-line gres.conf reverted the cluster
    to production state.

--Paul Brunk, system administrator

    Georgia Advanced Computing Resource Center
    Enterprise IT Svcs, the University of Georgia

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

Re: [slurm-users] [External] Re: exempting a node from Gres Autodetect

Reply via email to