On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote:
Partial progress. The scientist that developed theĀ model took a look at
the output and found that instead of one model run being ran in parallel
srun had ran multiple instancesĀ of the model, one per thread, which for
this test was 110 t
Partial progress. The scientist that developed the model took a look at the
output and found that instead of one model run being ran in parallel srun
had ran multiple instances of the model, one per thread, which for this
test was 110 threads.
I have a feeling this just verified the same thing that
I tried a simple thing of swapping out mpirun in the sbatch script for
srun. Nothing more, nothing less.
The model is now working on at least two nodes, I will have to test again
on more but this is progress.
Thanks,
Chris Woelkers
IT Specialist
National Oceanic and Atmospheric Agency
Great Lakes
Thanks all for the ideas and possibilities. I will answer all in turn.
Paul: Neither of the switches in use, Ethernet and Infiniband, have any
form of broadcast storm protection enabled.
Chris: I have passed on your question to the scientist that created
the sbatch script. I will also look into o
I had a simmilar issue, please check if the home drive, or the place
the data should be stored is mounted on the nodes.
On Tue, 2019-12-10 at 14:49 -0500, Chris Woelkers - NOAA Federal wrote:
> I have a 16 node HPC that is in the process of being upgraded from
> CentOS 6 to 7. All nodes are diskles
er 11, 2019 01:11
To: Slurm User Community List
Subject: Re: [slurm-users] Multi-node job failure
Thanks for the reply and the things to try. Here are the answers to your
questions/tests in order:
- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the n
Hi Chris,
On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal
wrote:
> Test jobs, submitted via sbatch, are able to run on one node with no problem
> but will not run on multiple nodes. The jobs are using mpirun and mvapich2
> is installed.
Is there a reason why you aren'
Hi Chris,
Your issue sounds similar to a case I ran into once, where I could run jobs
on a few nodes, but once it spanned more than a handful it would fail. In
that particular case, we figured out that it was due to broadcast storm
protection being enabled on the cluster switch. When the first n
Thanks for the reply and the things to try. Here are the answers to your
questions/tests in order:
- I tried mpiexec and the same issue occurred.
- While the job is listed as running I checked all the nodes. None of them
have processes spawned. I have no idea on the hydra process.
- I have version
We're running multiple clusters using Bright 8.x with Scientific Linux 7 (and
have run Scientific Linux releases 5 and 6 with Bright 5.0 and higher in the
past without issues on many different pieces of hardware) and never experienced
this. But some things to test :
- some implementations pref
10 matches
Mail list logo