Further investigation found that I had setup logrotate to handle a mysql
dump
mysqldump -R --single-transaction -B slurm_db | bzip2
which is what is taking 5 minutes. I think this is locking tables during
the time hanging calls to slurmdbd most likely and causing the issue.
I will need to rework it.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Mon, 19 Sep 2022 9:29am, Reed Dier wrote:
I’m not sure if this might be helpful, but my logrotate.d for slurm looks a bit
differently, namely instead of a systemctl reload, I am sending a specific
SIGUSR2 signal, which is supposedly for the specific purpose of logrotation in
slurm.
postrotate
pkill -x --signal SIGUSR2 slurmctld
pkill -x --signal SIGUSR2 slurmd
pkill -x --signal SIGUSR2 slurmdbd
exit 0
endscript
I would take a look here: https://slurm.schedmd.com/slurm.conf.html#lbAQ
<https://slurm.schedmd.com/slurm.conf.html#lbAQ>
Reed
On Sep 19, 2022, at 7:46 AM, Paul Raines <rai...@nmr.mgh.harvard.edu> wrote:
I have had two nights where right at 3:35am a bunch of jobs were
killed early with TIMEOUT way before their normal TimeLimit.
The slurmctld log has lots of lines like at 3:35am with
[2022-09-12T03:35:02.303] job_time_limit: inactivity time limit reached for
JobId=1636922
with jobs running on serveral different nodes.
The one curious thing is right about this time log rotation is happening
in cron on the slurmctld master node
Sep 12 03:30:02 mlsc-head run-parts[1719028]: (/etc/cron.daily) starting
logrotate
Sep 12 03:34:59 mlsc-head run-parts[1719028]: (/etc/cron.daily) finished
logrotate
The 5 minute runtime here is a big anomoly. On other machines, like
nodes just running slurmd or my web servers, this only takes a couple of
seconds.
In /etc/logrotate.d/slurmctl I have
postrotate
systemctl reload slurmdbd >/dev/null 2>/dev/null || true
/bin/sleep 1
systemctl reload slurmctld >/dev/null 2>/dev/null || true
endscript
Does it make sense that this could be causing the issue?
In slurm.conf I had InactiveLimit=60 which I guess is what is happening
but my reading of the docs on this setting was it only affects the
starting of a job with srun/salloc and not a job that has been running
for days. Is it InactiveLimit that leads to the "inactivity time limit
reached" message?
Anyway, I have changed InactiveLimit=600 to see if that helps.
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to
continue communication over unencrypted e-mail, please notify the sender of
this message immediately. Continuing to send or respond to e-mail after
receiving this message means you understand and accept this risk and wish to
continue to communicate over unencrypted e-mail.
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.