[slurm-users] Unable to run sequential jobs simultaneously on the same node

2024-08-17 Thread Arko Roy via slurm-users
I want to run 50 sequential jobs (essentially 50 copies of the same code
with different input parameters) on a particular node. However, as soon as
one of the jobs gets executed, the other 49 jobs get killed immediately
with exit code 9.  The jobs are not interacting and are strictly parallel.
However, if the 50 jobs run on 50 different nodes, it runs successfully.
Can anyone please help with possible fixes?
I see a discussion almost along the similar lines in
https://groups.google.com/g/slurm-users/c/I1T6GWcLjt4
But could not get the final solution.

-- 
Arko Roy
Assistant Professor
School of Physical Sciences
Indian Institute of Technology Mandi
Kamand, Mandi
Himachal Pradesh - 175 005, India
Email: a...@iitmandi.ac.in
Web: https://faculty.iitmandi.ac.in/~arko/

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Arko Roy via slurm-users
Thanks Loris and Gareth. here is the job submission script. if you find any
errors please let me know.
since i am not the admin but just an user, i think i dont have access to
the prolog and epilogue files.

If the jobs are independent, why do you want to run them all on the same
node?
I am running sequential codes. Essentially 50 copies of the same node with
a variation in parameter.
Since I am using the Slurm scheduler, the nodes and cores are allocated
depending upon the
available resources. So there are instances, when 20 of them goes to 20
free cores located on a particular
node and the rest 30 goes to the free 30 cores on another node. It turns
out that only 1 job out of 20 and 1 job
out of 30 are completed succesfully with exitcode 0 and the rest gets
terminated with exitcode 9.
for information, i run sjobexitmod -l jobid to check the exitcodes.

--
the submission script is as follows:



#!/bin/bash

# Setting slurm options



# lines starting with "#SBATCH" define your jobs parameters
# requesting the type of node on which to run job
##SBATCH --partition 
#SBATCH --partition=standard

# telling slurm how many instances of this job to spawn (typically 1)

##SBATCH --ntasks 
##SBATCH --ntasks=1
#SBATCH --nodes=1
##SBATCH -N 1
##SBATCH --ntasks-per-node=1



# setting number of CPUs per task (1 for serial jobs)

##SBATCH --cpus-per-task 

##SBATCH --cpus-per-task=1

# setting memory requirements

##SBATCH --mem-per-cpu 
#SBATCH --mem-per-cpu=1G

# propagating max time for job to run

##SBATCH --time 
##SBATCH --time 
##SBATCH --time 
#SBATCH --time 10:0:0
#SBATCH --job-name gstate

#module load compiler/intel/2018_4
module load fftw-3.3.10-intel-2021.6.0-ppbepka
echo "Running on $(hostname)"
echo "We are in $(pwd)"



# run the program

/home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Arko Roy via slurm-users
Dear Loris,

I just checked removing the &
it didn't work.

On Mon, Aug 19, 2024 at 1:43 PM Loris Bennett 
wrote:

> Dear Arko,
>
> Arko Roy  writes:
>
> > Thanks Loris and Gareth. here is the job submission script. if you find
> any errors please let me know.
> > since i am not the admin but just an user, i think i dont have access to
> the prolog and epilogue files.
> >
> > If the jobs are independent, why do you want to run them all on the same
> > node?
> > I am running sequential codes. Essentially 50 copies of the same node
> with a variation in parameter.
> > Since I am using the Slurm scheduler, the nodes and cores are allocated
> depending upon the
> > available resources. So there are instances, when 20 of them goes to 20
> free cores located on a particular
> > node and the rest 30 goes to the free 30 cores on another node. It turns
> out that only 1 job out of 20 and 1 job
> > out of 30 are completed succesfully with exitcode 0 and the rest gets
> terminated with exitcode 9.
> > for information, i run sjobexitmod -l jobid to check the exitcodes.
> >
> > --
> > the submission script is as follows:
> >
> > #!/bin/bash
> > 
> > # Setting slurm options
> > 
> >
> > # lines starting with "#SBATCH" define your jobs parameters
> > # requesting the type of node on which to run job
> > ##SBATCH --partition 
> > #SBATCH --partition=standard
> >
> > # telling slurm how many instances of this job to spawn (typically 1)
> >
> > ##SBATCH --ntasks 
> > ##SBATCH --ntasks=1
> > #SBATCH --nodes=1
> > ##SBATCH -N 1
> > ##SBATCH --ntasks-per-node=1
> >
> > # setting number of CPUs per task (1 for serial jobs)
> >
> > ##SBATCH --cpus-per-task 
> >
> > ##SBATCH --cpus-per-task=1
> >
> > # setting memory requirements
> >
> > ##SBATCH --mem-per-cpu 
> > #SBATCH --mem-per-cpu=1G
> >
> > # propagating max time for job to run
> >
> > ##SBATCH --time 
> > ##SBATCH --time 
> > ##SBATCH --time 
> > #SBATCH --time 10:0:0
> > #SBATCH --job-name gstate
> >
> > #module load compiler/intel/2018_4
> > module load fftw-3.3.10-intel-2021.6.0-ppbepka
> > echo "Running on $(hostname)"
> > echo "We are in $(pwd)"
> >
> > 
> > # run the program
> > 
> > /home/arkoroy.sps.iitmandi/ferro-detun/input1/a_1.out &
>
> You should not write
>
>   &
>
> at the end of the above command.  This will run your program in the
> background, which will cause the submit script to terminate, which in
> turn will terminate your job.
>
> Regards
>
> Loris
>
> --
> Dr. Loris Bennett (Herr/Mr)
> FUB-IT, Freie Universität Berlin
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com