Hi 

Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the 
command "sbatch" fails:

+ sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p 
operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
sbatch: error: Batch job submission failed: Socket timed out on send/recv 
operation

Ecflow runs preprocessing on the script which generates a second script that is 
submitted to slurm. In our case, the submission script is called "42.job1". 

The problem we have is that sometimes, the "sbatch" command fails with the 
message above. We couldn't find any hint on the logs. Hardware and software 
logs are clean. I increased the debug level of slurm, to 
# scontrol show config
(..._)
SlurmctldDebug          = info

But still not glue about what is happening. Maybe the next thing to try is to 
use "sdiag" to inspect the server. Another complication is that the problem is 
random, so we put "sdiag" in a cronjob? Is there a better way to run "sdiag" 
periodically?

Thnaks for your attention.

Best Regards

mg.

Reply via email to