Hello, I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me?
I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare. For my current user, I have the following limitations: Fairshare = 99 MaxJobs = 50 MaxSubmitJobs = 100 I did a little C++ program hello_world to do some tests and a 100 job hello_world array job is working properly. If I take the same program but I add a usleep of 10 seconds (to see the behavior with squeue and simulate a program a little longer), I have a part of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254 (in slurm log). The proportion of the error number of these jobs is variable between different executions. Here is the error output of one of these jobs: $ cat ERR/11617-9 /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable Note I have enough resources to run more than 50 jobs at the same time ... If I restart my submission script by forcing slurm to execute only 10 jobs at the same time (--array=1-100%10), all jobs succeed. But if I force slurm to execute only 30 jobs at the same time (--array=1-100%30), I have a part that fails again. Has anyone ever faced this type of problem? If so, please kindly enlighten me. Regards Jean-Mathieu Chantrein In charge of the LERIA computing center University of Angers __________________ hello_array.slurm #!/bin/bash # hello.slurm #SBATCH --job-name=hello #SBATCH --output=OUT/%A-%a #SBATCH --error=ERR/%A-%a #SBATCH --partition=std #SBATCH --array=1-100%10 ./hello $SLURM_ARRAY_TASK_ID ________________ main.cpp #include <iostream> #include <unistd.h> int main(int arg, char** argv) { usleep(10000000); std::cout<<"Hello world! job array number "<<argv[1]<<std::endl; return 0; }