Hi, On Tue, Apr 27, 2021 at 03:14:04PM +0000, O'Grady, Paul Christopher wrote:
> Sometimes when a slurm job fails I want to see what a user did, getting the > command/workdir/stdout/stderr information. I can see that with "scontrol > show job <jobid>". However, after the job is done that command doesn't seem > to work anymore, saying "invalid job id". I try to use sacct, which seems to > save history, but I can only find the "workdir" parameter there, not > stdout/stderr/cmd. I tried using the "jobname" field of sacct, but when I > use the "wrap" option of sbatch, then jobname only shows the string "wrap" > which isn't useful. > > My question: is there an easy way for me to get > command/workdir/stdout/stderr information after a job has completed? Thanks! Not sure if this is what you need. We do the following: In slurm.conf set: EpilogSlurmctld=/etc/slurm/slurm.epilogslurmctld Which does a number of things, including the following: root@pople01:/etc/slurm # tail -6 slurm.epilogslurmctld # 20150210 - Sean # Save the details of a job by doing an scontrol show job=job # So it can be referenced for trubleshooting in future if needed # should be run by the slurm epilog /usr/bin/scontrol show job="$SLURM_JOB_ID" > "$recordsdir/$SLURM_JOBID.record" So it writes the following to the file system: root@pople01:/etc/slurm # cat /home/support/root/slurm_job_records/pople/2021/6.record JobId=6 JobName=sbatch.sh UserId=smcgrat(5446) GroupId=smcgrat(9249) MCS_label=N/A Priority=1104631 Nice=0 Account=tchpc QOS=normal JobState=COMPLETING Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-04-27T15:56:12 EligibleTime=2021-04-27T15:56:12 AccrueTime=2021-04-27T15:56:12 StartTime=2021-04-27T15:56:13 EndTime=2021-04-27T15:56:13 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-04-27T15:56:13 Partition=compute AllocNode:Sid=pople01:14314 ReqNodeList=(null) ExcNodeList=(null) NodeList= BatchHost=pople-n001 NumNodes=2 NumCPUs=32 NumTasks=0 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=32,mem=126000M,node=2,billing=32 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=63000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home/users/smcgrat/sbatch.sh WorkDir=/home/users/smcgrat StdErr=/home/users/smcgrat/slurm-6.out StdIn=/dev/null StdOut=/home/users/smcgrat/slurm-6.out Power= Hope that helps. Sean > > chris > > -- Sean McGrath M.Sc Systems Administrator Trinity Centre for High Performance and Research Computing Trinity College Dublin sean.mcgr...@tchpc.tcd.ie https://www.tcd.ie/ https://www.tchpc.tcd.ie/ +353 (0) 1 896 3725