You don't put any limitation on your master nodes ? I answer myself. I only have to change the PropagateResourceLimits variable from slurm.conf to NONE. This is not a problem since I activate the cgroups directly on each of the compute nodes.
Regards. Jean-Mathieu > De: "Jean-Mathieu Chantrein" <jean-mathieu.chantr...@univ-angers.fr> > À: "Slurm User Community List" <slurm-users@lists.schedmd.com> > Envoyé: Vendredi 11 Janvier 2019 15:55:35 > Objet: Re: [slurm-users] Array job execution trouble: some jobs in the array > fail > Hello Jeffrey. > That's exactly it. I thank you very much, I would not have thought of that. I > have actually put a limitation of 20 nproc in /etc/security/limits.conf to > avoid potential misuse of some users. I had not imagined for one second that > it > could propagate on computational nodes! > You don't put any limitation on your master nodes ? > In any case, your help is particularly useful to me. Thanks a lot again. > Best regards. > Jean-Mathieu >> De: "Jeffrey Frey" <f...@udel.edu> >> À: "Slurm User Community List" <slurm-users@lists.schedmd.com> >> Envoyé: Vendredi 11 Janvier 2019 15:27:13 >> Objet: Re: [slurm-users] Array job execution trouble: some jobs in the array >> fail >> What does ulimit tell you on the compute node(s) where the jobs are running? >> The >> error message you cited arises when a user has reached the per-user process >> count limit (e.g. "ulimit -u"). If your Slurm config doesn't limit how many >> jobs a node can execute concurrently (e.g. oversubscribe), then: >>> - no matter what you have a race condition here (when/if the process limit >>> is >>> reached) >>> - the behavior is skewed toward happening more quickly/easily when your job >>> actually lasts a non-trivial amount of time (e.g. by adding the usleep()). >> It's likely you have stringent limits on your head/login node that are >> getting >> propagated to the compute environment (see PropagateResourceLimits in the >> slurm.conf documentation). By default Slurm propagates all ulimit's that are >> on >> your submission shell. >> E.g. >>> [frey@login00 ~]$ srun ... --propagate=NONE /bin/bash >>> [frey@login00 ~]$ hostname >>> [ http://r00n56.localdomain.hpc.udel.edu/ | r00n56.localdomain.hpc.udel.edu >>> ] >>> [frey@login00 ~]$ ulimit -u >>> 4096 >>> [frey@login00 ~]$ exit >>> : >>> [frey@login00 ~]$ ulimit -u 24 >>> [frey@login00 ~]$ srun ... --propagate=ALL /bin/bash >>> [frey@login00 ~]$ hostname >>> [ http://r00n49.localdomain.hpc.udel.edu/ | r00n49.localdomain.hpc.udel.edu >>> ] >>> [frey@login00 ~]$ ulimit -u >>> 24 >>> [frey@login00 ~]$ exit >>> On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN < [ >>> mailto:jean-mathieu.chantr...@univ-angers.fr | >>> jean-mathieu.chantr...@univ-angers.fr ] > wrote: >>> Hello, >>> I'm new to slurm (I used SGE before) and I'm new to this list. I have some >>> difficulties with the use of slurm's array jobs, maybe you can help me? >>> I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd >>> and >>> fairshare. >>> For my current user, I have the following limitations: >>> Fairshare = 99 >>> MaxJobs = 50 >>> MaxSubmitJobs = 100 >>> I did a little C++ program hello_world to do some tests and a 100 job >>> hello_world array job is working properly. >>> If I take the same program but I add a usleep of 10 seconds (to see the >>> behavior >>> with squeue and simulate a program a little longer), I have a part of my job >>> that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS >>> 254 >>> (in slurm log). The proportion of the error number of these jobs is variable >>> between different executions. Here is the error output of one of these jobs: >>> $ cat ERR/11617-9 >>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily >>> unavailable >>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily >>> unavailable >>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily >>> unavailable >>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily >>> unavailable >>> /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily >>> unavailable >>> Note I have enough resources to run more than 50 jobs at the same time ... >>> If I restart my submission script by forcing slurm to execute only 10 jobs >>> at >>> the same time (--array=1-100%10), all jobs succeed. But if I force slurm to >>> execute only 30 jobs at the same time (--array=1-100%30), I have a part that >>> fails again. >>> Has anyone ever faced this type of problem? If so, please kindly enlighten >>> me. >>> Regards >>> Jean-Mathieu Chantrein >>> In charge of the LERIA computing center >>> University of Angers >>> __________________ >>> hello_array.slurm >>> #!/bin/bash >>> # hello.slurm >>> #SBATCH --job-name=hello >>> #SBATCH --output=OUT/%A-%a >>> #SBATCH --error=ERR/%A-%a >>> #SBATCH --partition=std >>> #SBATCH --array=1-100%10 >>> ./hello $SLURM_ARRAY_TASK_ID >>> ________________ >>> main.cpp >>> #include <iostream> >>> #include <unistd.h> >>> int main(int arg, char** argv) { >>> usleep(10000000); >>> std::cout<<"Hello world! job array number "<<argv[1]<<std::endl; >>> return 0; >>> } >> :::::::::::::::::::::::::::::::::::::::::::::::::::::: >> Jeffrey T. Frey, Ph.D. >> Systems Programmer V / HPC Management >> Network & Systems Services / College of Engineering >> University of Delaware, Newark DE 19716 >> Office: (302) 831-6034 Mobile: (302) 419-4976 >> ::::::::::::::::::::::::::::::::::::::::::::::::::::::