Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Steffen Grunewald Tue, 11 Jun 2019 07:30:19 -0700

On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote:
> Hi 
> 
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes, 
> the command "sbatch" fails:
> 
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p 
> operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
> sbatch: error: Batch job submission failed: Socket timed out on send/recv 
> operation


I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...

> Ecflow runs preprocessing on the script which generates a second script that 
> is submitted to slurm. In our case, the submission script is called 
> "42.job1". 
> 
> The problem we have is that sometimes, the "sbatch" command fails with the 
> message above. We couldn't find any hint on the logs. Hardware and software 
> logs are clean. I increased the debug level of slurm, to 
> # scontrol show config
> (..._)
> SlurmctldDebug          = info
> 
> But still not glue about what is happening. Maybe the next thing to try is to 
> use "sdiag" to inspect the server. Another complication is that the problem 
> is random, so we put "sdiag" in a cronjob? Is there a better way to run 
> "sdiag" periodically?
> 
> Thnaks for your attention.
> 
> Best Regards
> 
> mg.
> 

- S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

Reply via email to