Only thing to suggest once again is increasing the logging of both slurmctl and slurmd. As for downgrading, I wouldn't suggest running a 17.x slurmdbd against a db built with 18.x.  I imagine there are enough changes there to cause trouble. I don't imagine downgrading will fix your issue, if you are running 18.08.0, the most recent release is 18.08.3.  NEWS packed in the tarballs gives the fixes in the versions.  I don't see any that would fit you case.

On 12/04/2018 02:11 PM, Lou Nicotra wrote:
Brian, I used a single gres.conf file and distributed to all nodes... Restarted all daemons, unfortunately scontrol still does not show any Gres resources for GPU nodes...

Will try to roll back to 17.X release. Is it basically a matter of removing 18.x rpms and installing 17's? Does the DB need to be downgraded also?

Thanks...
Lou

On Tue, Dec 4, 2018 at 10:25 AM Brian W. Johanson <bjoha...@psc.edu <mailto:bjoha...@psc.edu>> wrote:


    Do one more pass through making sure
    s/1080GTX/1080gtx and s/K20/k20

    shutdown all slurmd, slurmctld, start slurmctl, start slurmd


    I find it less confusing to have a global gres.conf file.  I haven't used
    a list (nvidia[0-1), mainly because I want to specify thethe cores to use
    for each gpu.

    gres.conf would look something like...

    NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
    File=/dev/nvidia0 Cores=0
    NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu Type=k80
    File=/dev/nvidia1 Cores=1
    NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia0 
Cores=0
    NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080gtx File=/dev/nvidia1 
Cores=1

    which can be distributed to all nodes.

    -b


    On 12/04/2018 09:55 AM, Lou Nicotra wrote:
    Brian, the specific node does not show any gres...
    root@panther02 slurm# scontrol show partition=tiger_1
    PartitionName=tiger_1
       AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
       AllocNodes=ALL Default=YES QoS=N/A
       DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 
Hidden=NO
       MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
    MaxCPUsPerNode=UNLIMITED
       Nodes=tiger[01-22]
       PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO 
OverSubscribe=NO
       OverTimeLimit=NONE PreemptMode=OFF
       State=UP TotalCPUs=1056 TotalNodes=22 SelectTypeParameters=NONE
       JobDefaults=(null)
       DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

    root@panther02 slurm#  scontrol show node=tiger11
    NodeName=tiger11 Arch=x86_64 CoresPerSocket=12
       CPUAlloc=0 CPUTot=48 CPULoad=11.50
       AvailableFeatures=HyperThread
       ActiveFeatures=HyperThread
       Gres=(null)
       NodeAddr=X.X.X.X NodeHostName=tiger11 Version=18.08
       OS=Linux 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015
       RealMemory=1 AllocMem=0 FreeMem=269695 Sockets=2 Boards=1
       State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
       Partitions=tiger_1,compute_1
       BootTime=2018-04-02T13:30:12 SlurmdStartTime=2018-12-03T16:13:22
       CfgTRES=cpu=48,mem=1M,billing=48
       AllocTRES=
       CapWatts=n/a
       CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
       ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

    So, something is not setup correctly... Could it be a 18.X bug?

    Thanks.


    On Tue, Dec 4, 2018 at 9:31 AM Lou Nicotra <lnico...@interactions.com
    <mailto:lnico...@interactions.com>> wrote:

        Thanks Michael. I will try 17.x as I also could not see anything
        wrong with my settings... Will report back afterwards...

        Lou

        On Tue, Dec 4, 2018 at 9:11 AM Michael Di Domenico
        <mdidomeni...@gmail.com <mailto:mdidomeni...@gmail.com>> wrote:

            unfortunately, someone smarter then me will have to help
            further.  I'm
            not sure i see anything specifically wrong.  The one thing i
            might try
            is backing the software down to a 17.x release series.  I recently
            tried 18.x and had some issues.  I can't say whether it'll be any
            different, but you might be exposing an undiagnosed bug in the 18.x
            branch
            On Mon, Dec 3, 2018 at 4:17 PM Lou Nicotra
            <lnico...@interactions.com <mailto:lnico...@interactions.com>> 
wrote:
            >
            > Made the change in the gres.conf on local server file and
            restarted slurmd and slurmctld on master.... Unfortunately same
            error...
            >
            > Distributed corrected gres.conf to all k20 servers, restarted
            slurmd and slurmdctl...   Still has same error...
            >
            > On Mon, Dec 3, 2018 at 4:04 PM Brian W. Johanson
            <bjoha...@psc.edu <mailto:bjoha...@psc.edu>> wrote:
            >>
            >> Is that a lowercase k in k20 specified in the batch script and
            nodename and a uppercase K specified in gres.conf?
            >>
            >> On 12/03/2018 09:13 AM, Lou Nicotra wrote:
            >>
            >> Hi All, I have recently set up a slurm cluster with my servers
            and I'm running into an issue while submitting GPU jobs. It has
            something to to with gres configurations, but I just can't seem
            to figure out what is wrong. Non GPU jobs run fine.
            >>
            >> The error is as follows:
            >> sbatch: error: Batch job submission failed: Invalid Trackable
            RESource (TRES) specification  after submitting a batch job.
            >>
            >> My batch job is as follows:
            >> #!/bin/bash
            >> #SBATCH --partition=tiger_1   # partition name
            >> #SBATCH --gres=gpu:k20:1
            >> #SBATCH --gres-flags=enforce-binding
            >> #SBATCH --time=0:20:00  # wall clock limit
            >> #SBATCH --output=gpu-%J.txt
            >> #SBATCH --account=lnicotra
            >> module load cuda
            >> python gpu1
            >>
            >> Where gpu1 is a GPU test script that runs correctly while
            invoked via python. Tiger_1 partition has servers with GPUs, with
            a mix of 1080GTX and K20 as specified in slurm.conf
            >>
            >> I have defined GRES resources in the slurm.conf file:
            >> # GPU GRES
            >> GresTypes=gpu
            >> NodeName=tiger[01,05,10,15,20] Gres=gpu:1080gtx:2
            >> NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Gres=gpu:k20:2
            >>
            >> And have a local gres.conf on the servers containing GPUs...
            >> lnicotra@tiger11 ~# cat /etc/slurm/gres.conf
            >> # GPU Definitions
            >> # NodeName=tiger[02-04,06-09,11-14,16-19,21-22] Name=gpu
            Type=K20 File=/dev/nvidia[0-1]
            >> Name=gpu Type=K20 File=/dev/nvidia[0-1] Cores=0,1
            >>
            >> and a similar one for the 1080GTX
            >> # GPU Definitions
            >> # NodeName=tiger[01,05,10,15,20] Name=gpu Type=1080GTX
            File=/dev/nvidia[0-1]
            >> Name=gpu Type=1080GTX File=/dev/nvidia[0-1] Cores=0,1
            >>
            >> The account manager seems to know about the GPUs...
            >> lnicotra@tiger11 ~# sacctmgr show tres
            >>     Type            Name     ID
            >> -------- --------------- ------
            >>      cpu                      1
            >>      mem                      2
            >>   energy                      3
            >>     node                      4
            >>  billing                      5
            >>       fs            disk      6
            >>     vmem                      7
            >>    pages                      8
            >>     gres             gpu   1001
            >>     gres         gpu:k20   1002
            >>     gres     gpu:1080gtx   1003
            >>
            >> Can anyone point out what am I missing?
            >>
            >> Thanks!
            >> Lou
            >>
            >>
            >> --
            >>
            >> Lou Nicotra
            >>
            >> IT Systems Engineer - SLT
            >>
            >> Interactions LLC
            >>
            >> o:  908-673-1833
            >>
            >> m: 908-451-6983
            >>
            >> lnico...@interactions.com <mailto:lnico...@interactions.com>
            >>
            >> www.interactions.com <http://www.interactions.com>
            >>
            >>
            
*******************************************************************************
            >>
            >> This e-mail and any of its attachments may contain
            Interactions LLC proprietary information, which is privileged,
            confidential, or subject to copyright belonging to the
            Interactions LLC. This e-mail is intended solely for the use of
            the individual or entity to which it is addressed. If you are not
            the intended recipient of this e-mail, you are hereby notified
            that any dissemination, distribution, copying, or action taken in
            relation to the contents of and attachments to this e-mail is
            strictly prohibited and may be unlawful. If you have received
            this e-mail in error, please notify the sender immediately and
            permanently delete the original and any copy of this e-mail and
            any printout. Thank You.
            >>
            >>
            
*******************************************************************************
            >>
            >>
            >
            >
            > --
            >
            > Lou Nicotra
            >
            > IT Systems Engineer - SLT
            >
            > Interactions LLC
            >
            > o:  908-673-1833
            >
            > m: 908-451-6983
            >
            > lnico...@interactions.com <mailto:lnico...@interactions.com>
            >
            > www.interactions.com <http://www.interactions.com>
            >
            >
            
*******************************************************************************
            >
            > This e-mail and any of its attachments may contain Interactions
            LLC proprietary information, which is privileged, confidential,
            or subject to copyright belonging to the Interactions LLC. This
            e-mail is intended solely for the use of the individual or entity
            to which it is addressed. If you are not the intended recipient
            of this e-mail, you are hereby notified that any dissemination,
            distribution, copying, or action taken in relation to the
            contents of and attachments to this e-mail is strictly prohibited
            and may be unlawful. If you have received this e-mail in error,
            please notify the sender immediately and permanently delete the
            original and any copy of this e-mail and any printout. Thank You.
            >
            >
            
*******************************************************************************



--
        *Lou Nicotra*

        IT Systems Engineer - SLT

        Interactions LLC

        o: 908-673-1833 <tel:781-405-5114>

        m: 908-451-6983 <tel:781-405-5114>

        _lnico...@interactions.com <mailto:lnico...@interactions.com>_

        www.interactions.com <http://www.interactions.com/>



--
    *Lou Nicotra*

    IT Systems Engineer - SLT

    Interactions LLC

    o: 908-673-1833 <tel:781-405-5114>

    m: 908-451-6983 <tel:781-405-5114>

    _lnico...@interactions.com <mailto:lnico...@interactions.com>_

    www.interactions.com <http://www.interactions.com/>

    
*******************************************************************************

    This e-mail and any of its attachments may contain Interactions LLC
    proprietary information, which is privileged, confidential, or subject to
    copyright belonging to the Interactions LLC. This e-mail is intended
    solely for the use of the individual or entity to which it is addressed.
    If you are not the intended recipient of this e-mail, you are hereby
    notified that any dissemination, distribution, copying, or action taken
    in relation to the contents of and attachments to this e-mail is strictly
    prohibited and may be unlawful. If you have received this e-mail in
    error, please notify the sender immediately and permanently delete the
    original and any copy of this e-mail and any printout. Thank You.

    
*******************************************************************************




--

*Lou Nicotra*

IT Systems Engineer - SLT

Interactions LLC

o: 908-673-1833 <tel:781-405-5114>

m: 908-451-6983 <tel:781-405-5114>

_lnico...@interactions.com <mailto:lnico...@interactions.com>_

www.interactions.com <http://www.interactions.com/>

*******************************************************************************

This e-mail and any of its attachments may contain Interactions LLC proprietary information, which is privileged, confidential, or subject to copyright belonging to the Interactions LLC. This e-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately and permanently delete the original and any copy of this e-mail and any printout. Thank You.

*******************************************************************************


Reply via email to