Hi Marcus, pilget-17 as submit host: $ id 1000 uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),991(vboxusers)
piglet-18: $ id 1000 uid=1000(turing) gid=1000(turing) groups=1000(turing),10(wheel),992(vboxusers) id 1000 is a local user for each node (piglet-17~19). I also tried to submit as ldap user, but still got the same error. Best regards, Eddy Swan On Mon, Oct 7, 2019 at 2:41 PM Marcus Wagner <wag...@itc.rwth-aachen.de> wrote: > Hi Eddy, > > what is the result of "id 1000" on the submithost and on piglet-18? > > Best > Marcus > > On 10/7/19 8:07 AM, Eddy Swan wrote: > > Hi All, > > I am currently testing slurm version 19.05.3-2 on Centos 7 with one master > and 3 nodes configuration. > I used the same configuration that works on version 17.02.7 but for some > reasons, it does not work on 19.05.3-2. > > $ srun hostname > srun: error: Unable to create step for job 19: Error generating job > credential > srun: Force Terminated job 19 > > If i run it as root, it works fine. > > $ sudo srun hostname > piglet-18 > > Configuration: > $ cat /etc/slurm/slurm.conf > # Common > ControlMachine=slurm-master > ControlAddr=10.15.131.32 > ClusterName=slurm-cluster > RebootProgram="/usr/sbin/reboot" > > MailProg=/bin/mail > ProctrackType=proctrack/cgroup > ReturnToService=2 > StateSaveLocation=/var/spool/slurmctld > TaskPlugin=task/cgroup > > # LOGGING AND ACCOUNTING > AccountingStorageType=accounting_storage/filetxt > AccountingStorageLoc=/var/log/slurm_acct/slurm_jobacct.log > JobCompLoc=/var/log/slurm_acct/slurm_jobcomp.log > JobAcctGatherType=jobacct_gather/cgroup > > # RESOURCES > MemLimitEnforce=no > > ## Rack 1 > NodeName=piglet-19 NodeAddr=10.15.2.19 RealMemory=64000 TmpDisk=512000 > Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2 > NodeName=piglet-18 NodeAddr=10.15.2.18 RealMemory=128000 TmpDisk=512000 > Sockets=2 CoresPerSocket=14 ThreadsPerCore=1 CPUSpecList=0,1 Weight=2 > NodeName=piglet-17 NodeAddr=10.15.2.17 RealMemory=64000 TmpDisk=512000 > Sockets=2 CoresPerSocket=28 ThreadsPerCore=1 CPUSpecList=0,1 Weight=3 > > # Preempt > PreemptMode=REQUEUE > PreemptType=preempt/qos > > PartitionName=batch Nodes=ALL MaxTime=2880 OverSubscribe=YES State=UP > PreemptMode=REQUEUE PriorityTier=10 Default=YES > > # TIMERS > KillWait=30 > MinJobAge=300 > MessageTimeout=3 > > # SCHEDULING > FastSchedule=1 > SchedulerType=sched/backfill > SelectType=select/cons_res > #SelectTypeParameters=CR_Core_Memory > SelectTypeParameters=CR_CPU_Memory > DefMemPerCPU=128 > > # Limit > MaxArraySize=201 > > # slurmctld > SlurmctldDebug=5 > SlurmctldLogFile=/var/log/slurm/slurmctld.log > SlurmctldPidFile=/var/slurm/slurmctld.pid > SlurmctldPort=6817 > SlurmctldTimeout=60 > SlurmUser=slurm > > # slurmd > SlurmdDebug=5 > SlurmdLogFile=/var/log/slurmd.log > SlurmdPort=6818 > SlurmdSpoolDir=/var/spool/slurmd > SlurmdTimeout=300 > > # REQUEUE > #RequeueExitHold=1-199,201-255 > #RequeueExit=200 > RequeueExitHold=201-255 > RequeueExit=200 > > Slurmctld.log > [2019-10-07T13:38:47.724] debug: sched: Running job scheduler > [2019-10-07T13:38:49.254] error: slurm_auth_get_host: Lookup failed: > Unknown host > [2019-10-07T13:38:49.255] sched: _slurm_rpc_allocate_resources JobId=19 > NodeList=piglet-18 usec=959 > [2019-10-07T13:38:49.259] debug: laying out the 1 tasks on 1 hosts > piglet-18 dist 2 > [2019-10-07T13:38:49.260] error: slurm_cred_create: getpwuid failed for > uid=1000 > [2019-10-07T13:38:49.260] error: slurm_cred_create error > [2019-10-07T13:38:49.262] _job_complete: JobId=19 WTERMSIG 1 > [2019-10-07T13:38:49.265] _job_complete: JobId=19 done > [2019-10-07T13:38:49.270] debug: sched: Running job scheduler > [2019-10-07T13:38:56.823] debug: sched: Running job scheduler > [2019-10-07T13:39:13.504] debug: backfill: beginning > [2019-10-07T13:39:13.504] debug: backfill: no jobs to backfill > [2019-10-07T13:39:40.871] debug: Spawning ping agent for piglet-19 > [2019-10-07T13:39:43.504] debug: backfill: beginning > [2019-10-07T13:39:43.504] debug: backfill: no jobs to backfill > [2019-10-07T13:39:46.999] error: slurm_auth_get_host: Lookup failed: > Unknown host > [2019-10-07T13:39:47.001] sched: _slurm_rpc_allocate_resources JobId=20 > NodeList=piglet-18 usec=979 > [2019-10-07T13:39:47.005] debug: laying out the 1 tasks on 1 hosts > piglet-18 dist 2 > [2019-10-07T13:39:47.144] _job_complete: JobId=20 WEXITSTATUS 0 > [2019-10-07T13:39:47.147] _job_complete: JobId=20 done > [2019-10-07T13:39:47.158] debug: sched: Running job scheduler > [2019-10-07T13:39:48.428] error: slurm_auth_get_host: Lookup failed: > Unknown host > [2019-10-07T13:39:48.429] sched: _slurm_rpc_allocate_resources JobId=21 > NodeList=piglet-18 usec=1114 > [2019-10-07T13:39:48.434] debug: laying out the 1 tasks on 1 hosts > piglet-18 dist 2 > [2019-10-07T13:39:48.559] _job_complete: JobId=21 WEXITSTATUS 0 > [2019-10-07T13:39:48.560] _job_complete: JobId=21 done > > slurmd.log on piglet-18 > [2019-10-07T13:38:42.746] debug: _rpc_terminate_job, uid = 3001 > [2019-10-07T13:38:42.747] debug: credential for job 17 revoked > [2019-10-07T13:38:47.721] debug: _rpc_terminate_job, uid = 3001 > [2019-10-07T13:38:47.722] debug: credential for job 18 revoked > [2019-10-07T13:38:49.267] debug: _rpc_terminate_job, uid = 3001 > [2019-10-07T13:38:49.268] debug: credential for job 19 revoked > [2019-10-07T13:39:47.014] launch task 20.0 request from UID:0 GID:0 > HOST:10.15.2.19 PORT:62137 > [2019-10-07T13:39:47.014] debug: Checking credential with 404 bytes of > sig data > [2019-10-07T13:39:47.016] _run_prolog: run job script took usec=7 > [2019-10-07T13:39:47.016] _run_prolog: prolog with lock for job 20 ran for > 0 seconds > [2019-10-07T13:39:47.026] debug: AcctGatherEnergy NONE plugin loaded > [2019-10-07T13:39:47.026] debug: AcctGatherProfile NONE plugin loaded > [2019-10-07T13:39:47.026] debug: AcctGatherInterconnect NONE plugin loaded > [2019-10-07T13:39:47.026] debug: AcctGatherFilesystem NONE plugin loaded > [2019-10-07T13:39:47.026] debug: switch NONE plugin loaded > [2019-10-07T13:39:47.028] [20.0] debug: CPUs:28 Boards:1 Sockets:2 > CoresPerSocket:14 ThreadsPerCore:1 > [2019-10-07T13:39:47.028] [20.0] debug: Job accounting gather cgroup > plugin loaded > [2019-10-07T13:39:47.028] [20.0] debug: cont_id hasn't been set yet not > running poll > [2019-10-07T13:39:47.029] [20.0] debug: Message thread started pid = 30331 > [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: now constraining > jobs allocated cores > [2019-10-07T13:39:47.030] [20.0] debug: task/cgroup: loaded > [2019-10-07T13:39:47.030] [20.0] debug: Checkpoint plugin loaded: > checkpoint/none > [2019-10-07T13:39:47.030] [20.0] Munge credential signature plugin loaded > [2019-10-07T13:39:47.031] [20.0] debug: job_container none plugin loaded > [2019-10-07T13:39:47.031] [20.0] debug: mpi type = none > [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup > '/sys/fs/cgroup/freezer/slurm' already exists > [2019-10-07T13:39:47.031] [20.0] debug: spank: opening plugin stack > /etc/slurm/plugstack.conf > [2019-10-07T13:39:47.031] [20.0] debug: mpi type = (null) > [2019-10-07T13:39:47.031] [20.0] debug: mpi/none: slurmstepd prefork > [2019-10-07T13:39:47.031] [20.0] debug: xcgroup_instantiate: cgroup > '/sys/fs/cgroup/cpuset/slurm' already exists > [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job abstract cores > are '2' > [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step abstract cores > are '2' > [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: job physical cores > are '4' > [2019-10-07T13:39:47.032] [20.0] debug: task/cgroup: step physical cores > are '4' > [2019-10-07T13:39:47.065] [20.0] debug level = 2 > [2019-10-07T13:39:47.065] [20.0] starting 1 tasks > [2019-10-07T13:39:47.066] [20.0] task 0 (30336) started 2019-10-07T13:39:47 > [2019-10-07T13:39:47.066] [20.0] debug: > jobacct_gather_cgroup_cpuacct_attach_task: jobid 20 stepid 0 taskid 0 > max_task_id 0 > [2019-10-07T13:39:47.066] [20.0] debug: xcgroup_instantiate: cgroup > '/sys/fs/cgroup/cpuacct/slurm' already exists > [2019-10-07T13:39:47.067] [20.0] debug: > jobacct_gather_cgroup_memory_attach_task: jobid 20 stepid 0 taskid 0 > max_task_id 0 > [2019-10-07T13:39:47.067] [20.0] debug: xcgroup_instantiate: cgroup > '/sys/fs/cgroup/memory/slurm' already exists > [2019-10-07T13:39:47.068] [20.0] debug: IO handler started pid=30331 > [2019-10-07T13:39:47.099] [20.0] debug: jag_common_poll_data: Task 0 pid > 30336 ave_freq = 1597534 mem size/max 0/0 vmem size/max > 210853888/210853888, disk read size/max (0/0), disk write size/max (0/0), > time 0.000000(0+0) Energy tot/max 0/0 TotPower 0 MaxPower 0 MinPower 0 > [2019-10-07T13:39:47.101] [20.0] debug: mpi type = (null) > [2019-10-07T13:39:47.101] [20.0] debug: Using mpi/none > [2019-10-07T13:39:47.102] [20.0] debug: CPUs:28 Boards:1 Sockets:2 > CoresPerSocket:14 ThreadsPerCore:1 > [2019-10-07T13:39:47.104] [20.0] debug: Sending launch resp rc=0 > [2019-10-07T13:39:47.105] [20.0] task 0 (30336) exited with exit code 0. > [2019-10-07T13:39:47.139] [20.0] debug: step_terminate_monitor_stop > signaling condition > [2019-10-07T13:39:47.139] [20.0] debug: Waiting for IO > [2019-10-07T13:39:47.140] [20.0] debug: Closing debug channel > [2019-10-07T13:39:47.140] [20.0] debug: IO handler exited, rc=0 > [2019-10-07T13:39:47.148] [20.0] debug: Message thread exited > [2019-10-07T13:39:47.149] [20.0] done with job > > I am not sure what i am missing. Hope someone can point out what i am > doing wrong here. > Thank you. > > Best regards, > Eddy Swan > > > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383wag...@itc.rwth-aachen.dewww.itc.rwth-aachen.de > >