Hello,
"Groner, Rob" writes:
> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.
Thanks to all who helped. It turned out that memory was the issue. I
have now reseated the RAM in the offend
Ole Holm Nielsen writes:
> 1. Is slurmd running on the node?
Yes.
> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097
> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.
> 4. Why h
Hello,
Davide DelVento writes:
> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?
This is what top shows:
last pid: 45688; load averages: 0.00, 0.00, 0.00
Hello,
Doug Meyer writes:
> Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell
> you the cause, fro example mem not matching the config.
>
REASON USER TIMESTAMP STATE NODELIST
Low RealMemory slurm(468) 2023-05-25T09:26:59 drai
Ole Holm Nielsen writes:
> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
Yes. It is what is available in ports.
> What's the output of "scontrol show node node012"?
NodeName=node012 CoresPerSocket
Hello,
slurm 20.02.7 on FreeBSD.
I have a couple of nodes stuck in the drain state. I have tried
scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume
without success.
I then tried
/usr/local/sbin/slurmctld -c
scontrol update
Gerhard Strangar writes:
> Run getent hosts node012 on all hosts to see which one can't resolve
> it.
Thank you, that located a problem with the hosts file on some nodes.
Fixed.
Best wishes,
Roger
Roger Mason writes:
> I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes
I forgot some information:
slurm 20.02.7 on FreeBSD 12.2.
New information:
Running this from the controller succeeds on both machines:
srun -w node[002,012] hostname
Hello,
I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes
with this written to slurm-*.out:
less 1x1x1_220524_121358/slurm-1368_1.out
srun: error: Unable to resolve "node012": Unknown server error
srun: error: fwd_tree_thread: can't find address for host node012, check
slurm.
Hello,
"Mccall, Kurt E. (MSFC-EV41)" writes:
> MPICH uses the PMI 1 interface by default, but for our 20.02.3 Slurm
> installation, “srun –mpi=list yields”
>
>
>
> $ srun --mpi=list
>
> srun: MPI types are...
>
> srun: cray_shasta
>
> srun: pmi2
>
> srun: none
>
>
>
> PMI 2 is there, but no
Hello,
Marcus Boden writes:
> the default time window starts at 00:00:00 of the current day:
> -S, --starttime
> Select jobs in any state after the specified time. Default
> is 00:00:00 of the current day, unless the '-s' or '-j'
> options are used. If the '
Hello,
Yesterday I instituted job accounting via mysql on my (FreeBSD 11.3)
test cluster. The cluster consists of a machine running
slurmctld+slurmdbd and two running slurmd (slurm version 20.02.1).
After experiencing a slurmdbd core dump when using mysql-5.7.30
(reported on this list on May 5) I
Hello,
Michael Di Domenico writes:
> did you copy the mca parameters file to all the compute nodes as well?
>
No need: my home directory is shared between the submit machine & the
nodes.
Cheers,
Roger
Hell Gilles,
gil...@rist.or.jp writes:
> is the home directory mounted at the same place regardless this is a
> frontend or a compute node ?
One host serves as both a frontend and compute node and is used to pixie
boot the other compute nodes. On the frontend machine (192.168.0.100) I
have:
mo
Hello Paul,
Paul Edmon writes:
> So the recommendation I've gotten the past is to us option number 4
> from this FAQ:
>
> https://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>
> This works for both mpirun and srun in slurm because its a flat file
> that is read rather than options t
Hello,
I've run into a problem passing MCA parameters to openmpi2. This runs
fine on the command-line:
/usr/local/mpi/openmpi2/bin/mpirun --mca btl_tcp_if_include \
192.168.0.0/24 -np 10 -hostfile ~/ompi.hosts \
~/Software/Gulp/gulp-5.0/gulp.ompi example2
If I put the the MCA parameters in ~/op
16 matches
Mail list logo