Re: [slurm-users] srun: Error generating job credential

Marcus Wagner Tue, 08 Oct 2019 23:52:58 -0700

Damn,

I almost always forget, that most of the submission part is done on themaster :/


Best
Marcus

On 10/8/19 11:45 AM, Eddy Swan wrote:

Hi Sean,

Thank you so much for your additional information.
The issue is indeed due to missing user on the head node.

After i configured ldap client on slurm-master, srun command is nowworking using ldap account.


Best regards,
Eddy Swan

On Tue, Oct 8, 2019 at 4:15 PM Sean Crosby <scro...@unimelb.edu.au<mailto:scro...@unimelb.edu.au>> wrote:


    Looking at the SLURM code, it looks like it is failing with a call
    to getpwuid_r on the ctld

    What is (on slurm-master):

    getent passwd turing
    getent passwd 1000

    Sean


    --
    Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
    Research Platform Services | Business Services
    CoEPP Research Computing | School of Physics
    The University of Melbourne, Victoria 3010 Australia


    On Mon, 7 Oct 2019 at 18:36, Eddy Swan <ed...@prestolabs.io
    <mailto:ed...@prestolabs.io>> wrote:

        Hi Marcus,

        pilget-17 as submit host:
        $ id 1000
        uid=1000(turing) gid=1000(turing)
        groups=1000(turing),10(wheel),991(vboxusers)

        piglet-18:
        $ id 1000
        uid=1000(turing) gid=1000(turing)
        groups=1000(turing),10(wheel),992(vboxusers)

        id 1000 is a local user for each node (piglet-17~19).
        I also tried to submit as ldap user, but still got the same error.

        Best regards,
        Eddy Swan

        On Mon, Oct 7, 2019 at 2:41 PM Marcus Wagner
        <wag...@itc.rwth-aachen.de <mailto:wag...@itc.rwth-aachen.de>>
        wrote:

            Hi Eddy,

            what is the result of "id 1000" on the submithost and on
            piglet-18?

            Best
            Marcus

            On 10/7/19 8:07 AM, Eddy Swan wrote:

            Hi All,

            I am currently testing slurm version 19.05.3-2 on Centos
            7 with one master and 3 nodes configuration.
            I used the same configuration that works on version
            17.02.7 but for some reasons, it does not work on 19.05.3-2.

            $ srun hostname
            srun: error: Unable to create step for job 19: Error
            generating job credential
            srun: Force Terminated job 19

            If i run it as root, it works fine.

            $ sudo srun hostname
            piglet-18

            Configuration:
            $ cat /etc/slurm/slurm.conf
            # Common
            ControlMachine=slurm-master
            ControlAddr=10.15.131.32
            ClusterName=slurm-cluster
            RebootProgram="/usr/sbin/reboot"

            MailProg=/bin/mail
            ProctrackType=proctrack/cgroup
            ReturnToService=2
            StateSaveLocation=/var/spool/slurmctld
            TaskPlugin=task/cgroup

            # LOGGING AND ACCOUNTING
            AccountingStorageType=accounting_storage/filetxt
            AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log
            JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log
            JobAcctGatherType=jobacct_gather/cgroup

            # RESOURCES
            MemLimitEnforce=no

            ## Rack 1
            NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000
            TmpDisk=512000 Sockets=2 CoresPerSocket=28
            ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
            NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000
            TmpDisk=512000 Sockets=2 CoresPerSocket=14
            ThreadsPerCore=1 CPUSpecList=0,1 Weight=2
            NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000
            TmpDisk=512000 Sockets=2 CoresPerSocket=28
            ThreadsPerCore=1 CPUSpecList=0,1 Weight=3

            # Preempt
            PreemptMode=REQUEUE
            PreemptType=preempt/qos

            PartitionName=batch Nodes=ALL MaxTime=2880
            OverSubscribe=YES State=UP PreemptMode=REQUEUE
            PriorityTier=10 Default=YES

            # TIMERS
            KillWait=30
            MinJobAge=300
            MessageTimeout=3

            # SCHEDULING
            FastSchedule=1
            SchedulerType=sched/backfill
            SelectType=select/cons_res
            #SelectTypeParameters=CR_Core_Memory
            SelectTypeParameters=CR_CPU_Memory
            DefMemPerCPU=128

            # Limit
            MaxArraySize=201

            # slurmctld
            SlurmctldDebug=5
            SlurmctldLogFile=/var/log/slurm/slurmctld.log
            SlurmctldPidFile=/var/slurm/slurmctld.pid
            SlurmctldPort=6817
            SlurmctldTimeout=60
            SlurmUser=slurm

            # slurmd
            SlurmdDebug=5
            SlurmdLogFile=/var/log/slurmd.log
            SlurmdPort=6818
            SlurmdSpoolDir=/var/spool/slurmd
            SlurmdTimeout=300

            # REQUEUE
            #RequeueExitHold=1-199,201-255
            #RequeueExit=200
            RequeueExitHold=201-255
            RequeueExit=200

            Slurmctld.log
            [2019-10-07T13:38:47.724] debug:  sched: Running job
            scheduler
            [2019-10-07T13:38:49.254] error: slurm_auth_get_host:
            Lookup failed: Unknown host
            [2019-10-07T13:38:49.255] sched:
            _slurm_rpc_allocate_resources JobId=19 NodeList=piglet-18
            usec=959
            [2019-10-07T13:38:49.259] debug:  laying out the 1 tasks
            on 1 hosts piglet-18 dist 2
            [2019-10-07T13:38:49.260] error: slurm_cred_create:
            getpwuid failed for uid=1000
            [2019-10-07T13:38:49.260] error: slurm_cred_create error
            [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1
            [2019-10-07T13:38:49.265] _job_complete: JobId=19 done
            [2019-10-07T13:38:49.270] debug:  sched: Running job
            scheduler
            [2019-10-07T13:38:56.823] debug:  sched: Running job
            scheduler
            [2019-10-07T13:39:13.504] debug:  backfill: beginning
            [2019-10-07T13:39:13.504] debug:  backfill: no jobs to
            backfill
            [2019-10-07T13:39:40.871] debug:  Spawning ping agent for
            piglet-19
            [2019-10-07T13:39:43.504] debug:  backfill: beginning
            [2019-10-07T13:39:43.504] debug:  backfill: no jobs to
            backfill
            [2019-10-07T13:39:46.999] error: slurm_auth_get_host:
            Lookup failed: Unknown host
            [2019-10-07T13:39:47.001] sched:
            _slurm_rpc_allocate_resources JobId=20 NodeList=piglet-18
            usec=979
            [2019-10-07T13:39:47.005] debug:  laying out the 1 tasks
            on 1 hosts piglet-18 dist 2
            [2019-10-07T13:39:47.144] _job_complete: JobId=20
            WEXITSTATUS 0
            [2019-10-07T13:39:47.147] _job_complete: JobId=20 done
            [2019-10-07T13:39:47.158] debug:  sched: Running job
            scheduler
            [2019-10-07T13:39:48.428] error: slurm_auth_get_host:
            Lookup failed: Unknown host
            [2019-10-07T13:39:48.429] sched:
            _slurm_rpc_allocate_resources JobId=21 NodeList=piglet-18
            usec=1114
            [2019-10-07T13:39:48.434] debug:  laying out the 1 tasks
            on 1 hosts piglet-18 dist 2
            [2019-10-07T13:39:48.559] _job_complete: JobId=21
            WEXITSTATUS 0
            [2019-10-07T13:39:48.560] _job_complete: JobId=21 done

            slurmd.log on piglet-18
            [2019-10-07T13:38:42.746] debug:  _rpc_terminate_job, uid
            = 3001
            [2019-10-07T13:38:42.747] debug:  credential for job 17
            revoked
            [2019-10-07T13:38:47.721] debug:  _rpc_terminate_job, uid
            = 3001
            [2019-10-07T13:38:47.722] debug:  credential for job 18
            revoked
            [2019-10-07T13:38:49.267] debug:  _rpc_terminate_job, uid
            = 3001
            [2019-10-07T13:38:49.268] debug:  credential for job 19
            revoked
            [2019-10-07T13:39:47.014] launch task 20.0 request from
            UID:0 GID:0 HOST:10.15.2.19 PORT:62137
            [2019-10-07T13:39:47.014] debug:  Checking credential
            with 404 bytes of sig data
            [2019-10-07T13:39:47.016] _run_prolog: run job script
            took usec=7
            [2019-10-07T13:39:47.016] _run_prolog: prolog with lock
            for job 20 ran for 0 seconds
            [2019-10-07T13:39:47.026] debug:  AcctGatherEnergy NONE
            plugin loaded
            [2019-10-07T13:39:47.026] debug:  AcctGatherProfile NONE
            plugin loaded
            [2019-10-07T13:39:47.026] debug:  AcctGatherInterconnect
            NONE plugin loaded
            [2019-10-07T13:39:47.026] debug:  AcctGatherFilesystem
            NONE plugin loaded
            [2019-10-07T13:39:47.026] debug:  switch NONE plugin loaded
            [2019-10-07T13:39:47.028] [20.0] debug:  CPUs:28 Boards:1
            Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
            [2019-10-07T13:39:47.028] [20.0] debug:  Job accounting
            gather cgroup plugin loaded
            [2019-10-07T13:39:47.028] [20.0] debug:  cont_id hasn't
            been set yet not running poll
            [2019-10-07T13:39:47.029] [20.0] debug:  Message thread
            started pid = 30331
            [2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: now
            constraining jobs allocated cores
            [2019-10-07T13:39:47.030] [20.0] debug:  task/cgroup: loaded
            [2019-10-07T13:39:47.030] [20.0] debug:  Checkpoint
            plugin loaded: checkpoint/none
            [2019-10-07T13:39:47.030] [20.0] Munge credential
            signature plugin loaded
            [2019-10-07T13:39:47.031] [20.0] debug:  job_container
            none plugin loaded
            [2019-10-07T13:39:47.031] [20.0] debug:  mpi type = none
            [2019-10-07T13:39:47.031] [20.0] debug:
             xcgroup_instantiate: cgroup
            '/sys/fs/cgroup/freezer/slurm' already exists
            [2019-10-07T13:39:47.031] [20.0] debug:  spank: opening
            plugin stack /etc/slurm/plugstack.conf
            [2019-10-07T13:39:47.031] [20.0] debug:  mpi type = (null)
            [2019-10-07T13:39:47.031] [20.0] debug:  mpi/none:
            slurmstepd prefork
            [2019-10-07T13:39:47.031] [20.0] debug:
             xcgroup_instantiate: cgroup
            '/sys/fs/cgroup/cpuset/slurm' already exists
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job
            abstract cores are '2'
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup:
            step abstract cores are '2'
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup: job
            physical cores are '4'
            [2019-10-07T13:39:47.032] [20.0] debug:  task/cgroup:
            step physical cores are '4'
            [2019-10-07T13:39:47.065] [20.0] debug level = 2
            [2019-10-07T13:39:47.065] [20.0] starting 1 tasks
            [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started
            2019-10-07T13:39:47
            [2019-10-07T13:39:47.066] [20.0] debug:
             jobacct_gather_cgroup_cpuacct_attach_task: jobid 20
            stepid 0 taskid 0 max_task_id 0
            [2019-10-07T13:39:47.066] [20.0] debug:
             xcgroup_instantiate: cgroup
            '/sys/fs/cgroup/cpuacct/slurm' already exists
            [2019-10-07T13:39:47.067] [20.0] debug:
             jobacct_gather_cgroup_memory_attach_task: jobid 20
            stepid 0 taskid 0 max_task_id 0
            [2019-10-07T13:39:47.067] [20.0] debug:
             xcgroup_instantiate: cgroup
            '/sys/fs/cgroup/memory/slurm' already exists
            [2019-10-07T13:39:47.068] [20.0] debug:  IO handler
            started pid=30331
            [2019-10-07T13:39:47.099] [20.0] debug:
             jag_common_poll_data: Task 0 pid 30336 ave_freq =
            1597534 mem size/max 0/0 vmem size/max
            210853888/210853888, disk read size/max (0/0), disk write
            size/max (0/0), time 0.000000(0+0) Energy tot/max 0/0
            TotPower 0 MaxPower 0 MinPower 0
            [2019-10-07T13:39:47.101] [20.0] debug:  mpi type = (null)
            [2019-10-07T13:39:47.101] [20.0] debug:  Using mpi/none
            [2019-10-07T13:39:47.102] [20.0] debug:  CPUs:28 Boards:1
            Sockets:2 CoresPerSocket:14 ThreadsPerCore:1
            [2019-10-07T13:39:47.104] [20.0] debug:  Sending launch
            resp rc=0
            [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited
            with exit code 0.
            [2019-10-07T13:39:47.139] [20.0] debug:
             step_terminate_monitor_stop signaling condition
            [2019-10-07T13:39:47.139] [20.0] debug:  Waiting for IO
            [2019-10-07T13:39:47.140] [20.0] debug:  Closing debug
            channel
            [2019-10-07T13:39:47.140] [20.0] debug:  IO handler
            exited, rc=0
            [2019-10-07T13:39:47.148] [20.0] debug:  Message thread
            exited
            [2019-10-07T13:39:47.149] [20.0] done with job

            I am not sure what i am missing. Hope someone can point
            out what i am doing wrong here.
            Thank you.

            Best regards,
            Eddy Swan

--Marcus Wagner, Dipl.-Inf.


            IT Center
            Abteilung: Systeme und Betrieb
            RWTH Aachen University
            Seffenter Weg 23
            52074 Aachen
            Tel: +49 241 80-24383
            Fax: +49 241 80-624383
            wag...@itc.rwth-aachen.de  <mailto:wag...@itc.rwth-aachen.de>
            www.itc.rwth-aachen.de  <http://www.itc.rwth-aachen.de>


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] srun: Error generating job credential

Reply via email to