Hi Sean, Thank you so much for your additional information. The issue is indeed due to missing user on the head node. After i configured ldap client on slurm-master, srun command is now working using ldap account.
Best regards, Eddy Swan On Tue, Oct 8, 2019 at 4:15 PM Sean Crosby <scro...@unimelb.edu.au> wrote: > Looking at the SLURM code, it looks like it is failing with a call to > getpwuid_r > on the ctld > > What is (on slurm-master): > > getent passwd turing > getent passwd 1000 > > Sean > > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Platform Services | Business Services > CoEPP Research Computing | School of Physics > The University of Melbourne, Victoria 3010 Australia > > > On Mon, 7 Oct 2019 at 18:36, Eddy Swan <ed...@prestolabs.io> wrote: > >> Hi Marcus, >> >> pilget-17 as submit host: >> $ id 1000 >> uid=1000(turing) gid=1000(turing) >> groups=1000(turing),10(wheel),991(vboxusers) >> >> piglet-18: >> $ id 1000 >> uid=1000(turing) gid=1000(turing) >> groups=1000(turing),10(wheel),992(vboxusers) >> >> id 1000 is a local user for each node (piglet-17~19). >> I also tried to submit as ldap user, but still got the same error. >> >> Best regards, >> Eddy Swan >> >> On Mon, Oct 7, 2019 at 2:41 PM Marcus Wagner <wag...@itc.rwth-aachen.de> >> wrote: >> >>> Hi Eddy, >>> >>> what is the result of "id 1000" on the submithost and on piglet-18? >>> >>> Best >>> Marcus >>> >>> On 10/7/19 8:07 AM, Eddy Swan wrote: >>> >>> Hi All, >>> >>> I am currently testing slurm version 19.05.3-2 on Centos 7 with one >>> master and 3 nodes configuration. >>> I used the same configuration that works on version 17.02.7 but for some >>> reasons, it does not work on 19.05.3-2. >>> >>> $ srun hostname >>> srun: error: Unable to create step for job 19: Error generating job >>> credential >>> srun: Force Terminated job 19 >>> >>> If i run it as root, it works fine. >>> >>> $ sudo srun hostname >>> piglet-18 >>> >>> Configuration: >>> $ cat /etc/slurm/slurm.conf >>> # Common >>> ControlMachine=slurm-master >>> ControlAddr=10.15.131.32 >>> ClusterName=slurm-cluster >>> RebootProgram="/usr/sbin/reboot" >>> >>> MailProg=/bin/mail >>> ProctrackType=proctrack/cgroup >>> ReturnToService=2 >>> StateSaveLocation=/var/spool/slurmctld >>> TaskPlugin=task/cgroup >>> >>> # LOGGING AND ACCOUNTING >>> AccountingStorageType=accounting_storage/filetxt >>> AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log >>> JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log >>> JobAcctGatherType=jobacct_gather/cgroup >>> >>> # RESOURCES >>> MemLimitEnforce=no >>> >>> ## Rack 1 >>> NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000 TmpDisk=512000 >>> Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2 >>> NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000 TmpDisk=512000 >>> Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2 >>> NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000 TmpDisk=512000 >>> Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=3 >>> >>> # Preempt >>> PreemptMode=REQUEUE >>> PreemptType=preempt/qos >>> >>> PartitionName=batch Nodes=ALL MaxTime=2880 OverSubscribe=YES State=UP >>> PreemptMode=REQUEUE PriorityTier=10 Default=YES >>> >>> # TIMERS >>> KillWait=30 >>> MinJobAge=300 >>> MessageTimeout=3 >>> >>> # SCHEDULING >>> FastSchedule=1 >>> SchedulerType=sched/backfill >>> SelectType=select/cons_res >>> #SelectTypeParameters=CR_Core_Memory >>> SelectTypeParameters=CR_CPU_Memory >>> DefMemPerCPU=128 >>> >>> # Limit >>> MaxArraySize=201 >>> >>> # slurmctld >>> SlurmctldDebug=5 >>> SlurmctldLogFile=/var/log/slurm/slurmctld.log >>> SlurmctldPidFile=/var/slurm/slurmctld.pid >>> SlurmctldPort=6817 >>> SlurmctldTimeout=60 >>> SlurmUser=slurm >>> >>> # slurmd >>> SlurmdDebug=5 >>> SlurmdLogFile=/var/log/slurmd.log >>> SlurmdPort=6818 >>> SlurmdSpoolDir=/var/spool/slurmd >>> SlurmdTimeout=300 >>> >>> # REQUEUE >>> #RequeueExitHold=1-199,201-255 >>> #RequeueExit=200 >>> RequeueExitHold=201-255 >>> RequeueExit=200 >>> >>> Slurmctld.log >>> [2019-10-07T13:38:47.724] debug: sched: Running job scheduler >>> [2019-10-07T13:38:49.254] error: slurm_auth_get_host: Lookup failed: >>> Unknown host >>> [2019-10-07T13:38:49.255] sched: _slurm_rpc_allocate_resources JobId=19 >>> NodeList=piglet-18 usec=959 >>> [2019-10-07T13:38:49.259] debug: laying out the 1 tasks on 1 hosts >>> piglet-18 dist 2 >>> [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed for >>> uid=1000 >>> [2019-10-07T13:38:49.260] error: slurm_cred_create error >>> [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1 >>> [2019-10-07T13:38:49.265] _job_complete: JobId=19 done >>> [2019-10-07T13:38:49.270] debug: sched: Running job scheduler >>> [2019-10-07T13:38:56.823] debug: sched: Running job scheduler >>> [2019-10-07T13:39:13.504] debug: backfill: beginning >>> [2019-10-07T13:39:13.504] debug: backfill: no jobs to backfill >>> [2019-10-07T13:39:40.871] debug: Spawning ping agent for piglet-19 >>> [2019-10-07T13:39:43.504] debug: backfill: beginning >>> [2019-10-07T13:39:43.504] debug: backfill: no jobs to backfill >>> [2019-10-07T13:39:46.999] error: slurm_auth_get_host: Lookup failed: >>> Unknown host >>> [2019-10-07T13:39:47.001] sched: _slurm_rpc_allocate_resources JobId=20 >>> NodeList=piglet-18 usec=979 >>> [2019-10-07T13:39:47.005] debug: laying out the 1 tasks on 1 hosts >>> piglet-18 dist 2 >>> [2019-10-07T13:39:47.144] _job_complete: JobId=20 WEXITSTATUS 0 >>> [2019-10-07T13:39:47.147] _job_complete: JobId=20 done >>> [2019-10-07T13:39:47.158] debug: sched: Running job scheduler >>> [2019-10-07T13:39:48.428] error: slurm_auth_get_host: Lookup failed: >>> Unknown host >>> [2019-10-07T13:39:48.429] sched: _slurm_rpc_allocate_resources JobId=21 >>> NodeList=piglet-18 usec=1114 >>> [2019-10-07T13:39:48.434] debug: laying out the 1 tasks on 1 hosts >>> piglet-18 dist 2 >>> [2019-10-07T13:39:48.559] _job_complete: JobId=21 WEXITSTATUS 0 >>> [2019-10-07T13:39:48.560] _job_complete: JobId=21 done >>> >>> slurmd.log on piglet-18 >>> [2019-10-07T13:38:42.746] debug: _rpc_terminate_job, uid = 3001 >>> [2019-10-07T13:38:42.747] debug: credential for job 17 revoked >>> [2019-10-07T13:38:47.721] debug: _rpc_terminate_job, uid = 3001 >>> [2019-10-07T13:38:47.722] debug: credential for job 18 revoked >>> [2019-10-07T13:38:49.267] debug: _rpc_terminate_job, uid = 3001 >>> [2019-10-07T13:38:49.268] debug: credential for job 19 revoked >>> [2019-10-07T13:39:47.014] launch task 20.0 request from UID:0 GID:0 >>> HOST:10.15.2.19 PORT:62137 >>> [2019-10-07T13:39:47.014] debug: Checking credential with 404 bytes of >>> sig data >>> [2019-10-07T13:39:47.016] _run_prolog: run job script took usec=7 >>> [2019-10-07T13:39:47.016] _run_prolog: prolog with lock for job 20 ran >>> for 0 seconds >>> [2019-10-07T13:39:47.026] debug: AcctGatherEnergy NONE plugin loaded >>> [2019-10-07T13:39:47.026] debug: AcctGatherProfile NONE plugin loaded >>> [2019-10-07T13:39:47.026] debug: AcctGatherInterconnect NONE plugin >>> loaded >>> [2019-10-07T13:39:47.026] debug: AcctGatherFilesystem NONE plugin loaded >>> [2019-10-07T13:39:47.026] debug: switch NONE plugin loaded >>> [2019-10-07T13:39:47.028] [20.0] debug: CPUs:28 Boards:1 Sockets:2 >>> CoresPerSocket:14 ThreadsPerCore:1 >>> [2019-10-07T13:39:47.028] [20.0] debug: Job accounting gather cgroup >>> plugin loaded >>> [2019-10-07T13:39:47.028] [20.0] debug: cont_id hasn't been set yet not >>> running poll >>> [2019-10-07T13:39:47.029] [20.0] debug: Message thread started pid = >>> 30331 >>> [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: now constraining >>> jobs allocated cores >>> [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: loaded >>> [2019-10-07T13:39:47.030] [20.0] debug: Checkpoint plugin loaded: >>> checkpoint/none >>> [2019-10-07T13:39:47.030] [20.0] Munge credential signature plugin loaded >>> [2019-10-07T13:39:47.031] [20.0] debug: job_container none plugin loaded >>> [2019-10-07T13:39:47.031] [20.0] debug: mpi type = none >>> [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup >>> '/sys/fs/cgroup/freezer/slurm' already exists >>> [2019-10-07T13:39:47.031] [20.0] debug: spank: opening plugin stack >>> /etc/slurm/plugstack.conf >>> [2019-10-07T13:39:47.031] [20.0] debug: mpi type = (null) >>> [2019-10-07T13:39:47.031] [20.0] debug: mpi/none: slurmstepd prefork >>> [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup >>> '/sys/fs/cgroup/cpuset/slurm' already exists >>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job abstract cores >>> are '2' >>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step abstract >>> cores are '2' >>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job physical cores >>> are '4' >>> [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step physical >>> cores are '4' >>> [2019-10-07T13:39:47.065] [20.0] debug level = 2 >>> [2019-10-07T13:39:47.065] [20.0] starting 1 tasks >>> [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started >>> 2019-10-07T13:39:47 >>> [2019-10-07T13:39:47.066] [20.0] debug: >>> jobacct_gather_cgroup_cpuacct_attach_task: jobid 20 stepid 0 taskid 0 >>> max_task_id 0 >>> [2019-10-07T13:39:47.066] [20.0] debug: xcgroup_instantiate: cgroup >>> '/sys/fs/cgroup/cpuacct/slurm' already exists >>> [2019-10-07T13:39:47.067] [20.0] debug: >>> jobacct_gather_cgroup_memory_attach_task: jobid 20 stepid 0 taskid 0 >>> max_task_id 0 >>> [2019-10-07T13:39:47.067] [20.0] debug: xcgroup_instantiate: cgroup >>> '/sys/fs/cgroup/memory/slurm' already exists >>> [2019-10-07T13:39:47.068] [20.0] debug: IO handler started pid=30331 >>> [2019-10-07T13:39:47.099] [20.0] debug: jag_common_poll_data: Task 0 >>> pid 30336 ave_freq = 1597534 mem size/max 0/0 vmem size/max >>> 210853888/210853888, disk read size/max (0/0), disk write size/max (0/0), >>> time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 >>> [2019-10-07T13:39:47.101] [20.0] debug: mpi type = (null) >>> [2019-10-07T13:39:47.101] [20.0] debug: Using mpi/none >>> [2019-10-07T13:39:47.102] [20.0] debug: CPUs:28 Boards:1 Sockets:2 >>> CoresPerSocket:14 ThreadsPerCore:1 >>> [2019-10-07T13:39:47.104] [20.0] debug: Sending launch resp rc=0 >>> [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited with exit code 0. >>> [2019-10-07T13:39:47.139] [20.0] debug: step_terminate_monitor_stop >>> signaling condition >>> [2019-10-07T13:39:47.139] [20.0] debug: Waiting for IO >>> [2019-10-07T13:39:47.140] [20.0] debug: Closing debug channel >>> [2019-10-07T13:39:47.140] [20.0] debug: IO handler exited, rc=0 >>> [2019-10-07T13:39:47.148] [20.0] debug: Message thread exited >>> [2019-10-07T13:39:47.149] [20.0] done with job >>> >>> I am not sure what i am missing. Hope someone can point out what i am >>> doing wrong here. >>> Thank you. >>> >>> Best regards, >>> Eddy Swan >>> >>> >>> -- >>> Marcus Wagner, Dipl.-Inf. >>> >>> IT Center >>> Abteilung: Systeme und Betrieb >>> RWTH Aachen University >>> Seffenter Weg 23 >>> 52074 Aachen >>> Tel: +49 241 80-24383 >>> Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de >>> >>>