Re: [slurm-users] Array job execution trouble: some jobs in the array fail

Jeffrey Frey Fri, 11 Jan 2019 06:31:04 -0800

What does ulimit tell you on the compute node(s) where the jobs are running?  
The error message you cited arises when a user has reached the per-user process 
count limit (e.g. "ulimit -u").  If your Slurm config doesn't limit how many 
jobs a node can execute concurrently (e.g. oversubscribe), then:



- no matter what you have a race condition here (when/if the process limit is 
reached)

- the behavior is skewed toward happening more quickly/easily when your job 
actually lasts a non-trivial amount of time (e.g. by adding the usleep()).


It's likely you have stringent limits on your head/login node that are getting 
propagated to the compute environment (see PropagateResourceLimits in the 
slurm.conf documentation).  By default Slurm propagates all ulimit's that are 
on your submission shell.


E.g.

[frey@login00 ~]$ srun ... --propagate=NONE /bin/bash
  [frey@login00 ~]$ hostname
  r00n56.localdomain.hpc.udel.edu
  [frey@login00 ~]$ ulimit -u
  4096
  [frey@login00 ~]$ exit
   :
[frey@login00 ~]$ ulimit -u 24
[frey@login00 ~]$ srun ... --propagate=ALL /bin/bash
  [frey@login00 ~]$ hostname
  r00n49.localdomain.hpc.udel.edu
  [frey@login00 ~]$ ulimit -u
  24
  [frey@login00 ~]$ exit


> On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN 
> <jean-mathieu.chantr...@univ-angers.fr> wrote:
> 
> Hello,
> 
> I'm new to slurm (I used SGE before) and I'm new to this list. I have some 
> difficulties with the use of slurm's array jobs, maybe you can help me?
> 
> I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd 
> and fairshare.
> 
> For my current user, I have the following limitations:
> Fairshare = 99
> MaxJobs = 50
> MaxSubmitJobs = 100
> 
> I did a little C++ program hello_world to do some tests and a 100 job 
> hello_world array job is working properly.
> If I take the same program but I add a usleep of 10 seconds (to see the 
> behavior with squeue and simulate a program a little longer), I have a part 
> of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and 
> WEXITSTATUS 254 (in slurm log). The proportion of the error number of these 
> jobs is variable between different executions. Here is the error output of 
> one of these jobs:
> 
> $ cat ERR/11617-9 
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily 
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily 
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily 
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily 
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily 
> unavailable
> 
> Note I have enough resources to run more than 50 jobs at the same time ...
> 
> If I restart my submission script by forcing slurm to execute only 10 jobs at 
> the same time (--array=1-100%10), all jobs succeed. But if I force slurm to 
> execute only 30 jobs at the same time (--array=1-100%30), I have a part that 
> fails again.
> 
> Has anyone ever faced this type of problem? If so, please kindly enlighten me.
> 
> Regards
> 
> Jean-Mathieu Chantrein
> In charge of the LERIA computing center
> University of Angers
> 
> __________________
> hello_array.slurm
> 
> #!/bin/bash
> # hello.slurm
> #SBATCH --job-name=hello
> #SBATCH --output=OUT/%A-%a
> #SBATCH --error=ERR/%A-%a
> #SBATCH --partition=std
> #SBATCH --array=1-100%10
> ./hello $SLURM_ARRAY_TASK_ID
> 
> ________________
> main.cpp
> 
> #include <iostream>
> #include <unistd.h>
> 
> int main(int arg, char** argv) {
>     usleep(10000000);
>     std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;
>     return 0;
> }
> 
> 


::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::

Re: [slurm-users] Array job execution trouble: some jobs in the array fail

Reply via email to