Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Ole Holm Nielsen
On 4/5/19 4:28 PM, Julien Rey wrote: The failure occurs after a few minutes (~10). And we are running out of space on the slurm controller. The mysql daemon is at 100% CPU usage all the time. This issue is becoming critical. ... Our slurm accounting database is growing bigger and bigger (more

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Ole Holm Nielsen
Hi Julien, Did you optimize the MySQL database, in particular InnoDB? I have collected some documentation in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration and I also discuss database purging. Please note that we run Slurm 17.11 (and recently 18.08) on Cent

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Julien Rey
The failure occurs after a few minutes (~10). And we are running out of space on the slurm controller. The mysql daemon is at 100% CPU usage all the time. This issue is becoming critical. Le 05/04/2019 16:10, Paul Edmon a écrit : Did it just time out, or did that failure happen immediately. I

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Paul Edmon
Did it just time out, or did that failure happen immediately.  If immediate you may be in a situation where you are hitting a bug. It "should" be safe to upgrade to a later version of 15.08.*. There may be fixes in there related to that.  I would look at the changelog though just to see if ther

Re: [slurm-users] disable-bindings disables counting of gres resources

2019-04-05 Thread Quirin Lohr
Same problem here: a Job submitted with gres-flags=disable-bindings is assigned a node, but then the job step fails because all GPUs on that node are already in use. Log messages: [2019-04-05T15:29:05.216] error: gres/gpu: job 92453 node node5 overallocated resources by 1, (9 > 8) [2019-04-05

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Julien Rey
Hi Paul, thanks for your advice. Actually I already tried what you suggested. No matter what value do I put after PurgeJobAfter I always end up with the same error: sacctmgr archive dump Directory=/home/joule/archives/ PurgeJobAfter=1days sacctmgr: error: slurmdbd: Getting response to message t

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-05 Thread Ole Holm Nielsen
Hi Lech, Thanks! I added the 18.08 Release Notes reference to https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#database-upgrade-from-slurm-17-02-and-older I've already upgraded from 17.11 to 18.08 without your patch, and this went smoothly as expected. We upgraded from 17.02 to 17.11 l

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-05 Thread Lech Nieroda
Hi Ole, your summary is correct as far as I can tell and will hopefully help some users. One thing I’d add is the remark from the 18.08 Release Notes ( https://github.com/SchedMD/slurm/blob/slurm-18.08/RELEASE_NOTES ), which adds mysql 5.5 to the list. They’ve mentioned that mysql 5.5 is the def

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-05 Thread Ole Holm Nielsen
Hi Lech, I've tried to summarize your work on the Slurm database upgrade patch in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#database-upgrade-from-slurm-17-02-and-older Could you kindly check if my notes are correct and complete? Hopefully this Wiki will also h