On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 > sbatch: error: Batch job submission failed: Socket timed out on send/recv > operation
I've seen such an error message from the underlying file system. Is there anything special (e.g. non-NFS) in your setup that may have changed in the past few months? Just a shot in the dark, of course... > Ecflow runs preprocessing on the script which generates a second script that > is submitted to slurm. In our case, the submission script is called > "42.job1". > > The problem we have is that sometimes, the "sbatch" command fails with the > message above. We couldn't find any hint on the logs. Hardware and software > logs are clean. I increased the debug level of slurm, to > # scontrol show config > (..._) > SlurmctldDebug = info > > But still not glue about what is happening. Maybe the next thing to try is to > use "sdiag" to inspect the server. Another complication is that the problem > is random, so we put "sdiag" in a cronjob? Is there a better way to run > "sdiag" periodically? > > Thnaks for your attention. > > Best Regards > > mg. > - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~