Hi, On one of our clusters the "Epilog" setup in slurm.conf, (Epilog=/etc/slurm/slurm.epilog.local) call's the following to run the tmpwatch utility with a tiny access time on /tmp. I think tmpwatch can be run on specified paths and not just /tmp.
##################################### # First, clean out /tmp, ruthlessly # ##################################### # 2022-04-20, smcg...@tcd.ie, RT#25239, RT#25262 # this will delete anything in /tmp older than a minute outfile=/tmp/epilog.$CLUSTER.$(date +%Y%m%d%H%M%S).txt { echo "------------------------------------------------------------------" echo -n "* /tmp maintenance - " date echo -n "* Current usage" echo "" df -h /tmp echo -n "* Running tmpwatch to delete anything in /tmp older than a minute" /usr/sbin/tmpwatch --atime 1m /tmp echo "" echo -n "* /tmp usage now" echo "" df -h /tmp echo "" } >> "${outfile}" The last thing our epilog setup does is run the standard /etc/slurm/slurm.epilog.clean.dist. Looking at that it uses the SLURM_UID and SLURM_JOB_ID variables. I would guess that it also has access to the other variables in the context. Best Sean --- Sean McGrath Senior Systems Administrator, IT Services ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Jason Simms <jsim...@swarthmore.edu> Sent: Tuesday 10 October 2023 16:59 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: [slurm-users] Clean Up Scratch After Failed Job Hello all, Our template scripts for Slurm include a workflow to copy files to a scratch space prior to running a job, and then copying any output files, etc. back to the original submit directory on job completion, and then finally cleaning up (deleting) the scratch space before exiting. This works great until a job fails or is requeued, in which case the scratch space isn't cleaned up. In the past, I've run a cron job that deletes any material in scratch that hasn't been modified in any number of days beyond the max length of a job, but that can still allow "zombie" material to remain in scratch for quite a while. I'm intrigued by using an epilog script that is triggered after each job completes (whether normally or due to failure, requeuing, etc.) to accomplish the same task more efficiently and consistently. The first question is in which context would I run the epilog. I presume I'd want to run it after a job completes entirely, so looking at the table, I think I'd want an Epilog script to run on the compute node. Reading the documentation, however, it is unclear to me that all variables I would need will be available in such a script. We use the variables $USER, $SLURM_JOB_NAME, and $SLURM_JOB_ID to create a path within scratch unique to each job. Specifically, however, the documentation for $SLURM_JOB_NAME says: "SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog." So it doesn't seem to be available in the appropriate context. Thinking about it, however, I presume if I only use the $SLURM_JOB_ID and $USER (and then $SLURM_JOB_USER in the epilog script) that the path would still be unique; meaning, I could just not use the job name. Anyway, if anyone has any thoughts or examples of setting up something like this, I'd appreciate it! Warmest regards, Jason -- Jason L. Simms, Ph.D., M.P.H. Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms