Re: [slurm-users] Clean Up Scratch After Failed Job

Sean Mc Grath Tue, 10 Oct 2023 09:22:28 -0700

Hi,

On one of our clusters the "Epilog" setup in slurm.conf, 
(Epilog=/etc/slurm/slurm.epilog.local) call's the following to run the tmpwatch 
utility with a tiny access time on /tmp. I think tmpwatch can be run on 
specified paths and not just /tmp.


#####################################
# First, clean out /tmp, ruthlessly #
#####################################
# 2022-04-20, smcg...@tcd.ie, RT#25239, RT#25262
# this will delete anything in /tmp older than a minute

outfile=/tmp/epilog.$CLUSTER.$(date +%Y%m%d%H%M%S).txt
{
echo "------------------------------------------------------------------"
echo -n "* /tmp maintenance - "
date
echo -n "* Current usage"
echo ""
df -h /tmp

echo -n "* Running tmpwatch to delete anything in /tmp older than a minute"
/usr/sbin/tmpwatch --atime 1m /tmp
echo ""

echo -n "* /tmp usage now"
echo ""
df -h /tmp

echo ""
} >> "${outfile}"

The last thing our epilog setup does is run the standard 
/etc/slurm/slurm.epilog.clean.dist. Looking at that it uses the SLURM_UID and 
SLURM_JOB_ID variables. I would guess that it also has access to the other 
variables in the context.

Best

Sean

---
Sean McGrath
Senior Systems Administrator, IT Services

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Jason 
Simms <jsim...@swarthmore.edu>
Sent: Tuesday 10 October 2023 16:59
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Clean Up Scratch After Failed Job

Hello all,

Our template scripts for Slurm include a workflow to copy files to a scratch 
space prior to running a job, and then copying any output files, etc. back to 
the original submit directory on job completion, and then finally cleaning up 
(deleting) the scratch space before exiting. This works great until a job fails 
or is requeued, in which case the scratch space isn't cleaned up.

In the past, I've run a cron job that deletes any material in scratch that 
hasn't been modified in any number of days beyond the max length of a job, but 
that can still allow "zombie" material to remain in scratch for quite a while. 
I'm intrigued by using an epilog script that is triggered after each job 
completes (whether normally or due to failure, requeuing, etc.) to accomplish 
the same task more efficiently and consistently.

The first question is in which context would I run the epilog. I presume I'd 
want to run it after a job completes entirely, so looking at the table, I think 
I'd want an Epilog script to run on the compute node. Reading the 
documentation, however, it is unclear to me that all variables I would need 
will be available in such a script. We use the variables $USER, 
$SLURM_JOB_NAME, and $SLURM_JOB_ID to create a path within scratch unique to 
each job.

Specifically, however, the documentation for $SLURM_JOB_NAME says:

"SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, 
TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog."

So it doesn't seem to be available in the appropriate context. Thinking about 
it, however, I presume if I only use the $SLURM_JOB_ID and $USER (and then 
$SLURM_JOB_USER in the epilog script) that the path would still be unique; 
meaning, I could just not use the job name.

Anyway, if anyone has any thoughts or examples of setting up something like 
this, I'd appreciate it!

Warmest regards,
Jason

--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms

Re: [slurm-users] Clean Up Scratch After Failed Job

Reply via email to