What does ulimit tell you on the compute node(s) where the jobs are running? The error message you cited arises when a user has reached the per-user process count limit (e.g. "ulimit -u"). If your Slurm config doesn't limit how many jobs a node can execute concurrently (e.g. oversubscribe), then:
- no matter what you have a race condition here (when/if the process limit is reached) - the behavior is skewed toward happening more quickly/easily when your job actually lasts a non-trivial amount of time (e.g. by adding the usleep()). It's likely you have stringent limits on your head/login node that are getting propagated to the compute environment (see PropagateResourceLimits in the slurm.conf documentation). By default Slurm propagates all ulimit's that are on your submission shell. E.g. [frey@login00 ~]$ srun ... --propagate=NONE /bin/bash [frey@login00 ~]$ hostname r00n56.localdomain.hpc.udel.edu [frey@login00 ~]$ ulimit -u 4096 [frey@login00 ~]$ exit : [frey@login00 ~]$ ulimit -u 24 [frey@login00 ~]$ srun ... --propagate=ALL /bin/bash [frey@login00 ~]$ hostname r00n49.localdomain.hpc.udel.edu [frey@login00 ~]$ ulimit -u 24 [frey@login00 ~]$ exit > On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN > <jean-mathieu.chantr...@univ-angers.fr> wrote: > > Hello, > > I'm new to slurm (I used SGE before) and I'm new to this list. I have some > difficulties with the use of slurm's array jobs, maybe you can help me? > > I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd > and fairshare. > > For my current user, I have the following limitations: > Fairshare = 99 > MaxJobs = 50 > MaxSubmitJobs = 100 > > I did a little C++ program hello_world to do some tests and a 100 job > hello_world array job is working properly. > If I take the same program but I add a usleep of 10 seconds (to see the > behavior with squeue and simulate a program a little longer), I have a part > of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and > WEXITSTATUS 254 (in slurm log). The proportion of the error number of these > jobs is variable between different executions. Here is the error output of > one of these jobs: > > $ cat ERR/11617-9 > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily > unavailable > /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily > unavailable > > Note I have enough resources to run more than 50 jobs at the same time ... > > If I restart my submission script by forcing slurm to execute only 10 jobs at > the same time (--array=1-100%10), all jobs succeed. But if I force slurm to > execute only 30 jobs at the same time (--array=1-100%30), I have a part that > fails again. > > Has anyone ever faced this type of problem? If so, please kindly enlighten me. > > Regards > > Jean-Mathieu Chantrein > In charge of the LERIA computing center > University of Angers > > __________________ > hello_array.slurm > > #!/bin/bash > # hello.slurm > #SBATCH --job-name=hello > #SBATCH --output=OUT/%A-%a > #SBATCH --error=ERR/%A-%a > #SBATCH --partition=std > #SBATCH --array=1-100%10 > ./hello $SLURM_ARRAY_TASK_ID > > ________________ > main.cpp > > #include <iostream> > #include <unistd.h> > > int main(int arg, char** argv) { > usleep(10000000); > std::cout<<"Hello world! job array number "<<argv[1]<<std::endl; > return 0; > } > > :::::::::::::::::::::::::::::::::::::::::::::::::::::: Jeffrey T. Frey, Ph.D. Systems Programmer V / HPC Management Network & Systems Services / College of Engineering University of Delaware, Newark DE 19716 Office: (302) 831-6034 Mobile: (302) 419-4976 ::::::::::::::::::::::::::::::::::::::::::::::::::::::