Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, "Groner, Rob" writes: > A quick test to see if it's a configuration error is to set > config_overrides in your slurm.conf and see if the node then responds > to scontrol update. Thanks to all who helped. It turned out that memory was the issue. I have now reseated the RAM in the offend

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > 1. Is slurmd running on the node? Yes. > 2. What's the output of "slurmd -C" on the node? NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=6097 > 3. Define State=UP in slurm.conf in stead of UNKNOWN Will do. > 4. Why h

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Davide DelVento writes: > Can you ssh into the node and check the actual availability of memory? > Maybe there is a zombie process (or a healthy one with a memory leak > bug) that's hogging all the memory? This is what top shows: last pid: 45688; load averages: 0.00, 0.00, 0.00

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Doug Meyer writes: > Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell > you the cause, fro example mem not matching the config. > REASON USER TIMESTAMP STATE NODELIST Low RealMemory slurm(468) 2023-05-25T09:26:59 drai

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > On 5/25/23 13:59, Roger Mason wrote: >> slurm 20.02.7 on FreeBSD. > > Uh, that's old! Yes. It is what is available in ports. > What's the output of "scontrol show node node012"? NodeName=node012 CoresPerSocket

[slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, slurm 20.02.7 on FreeBSD. I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then tried /usr/local/sbin/slurmctld -c scontrol update

Re: [slurm-users] Jobs fail on specific nodes.

2022-05-25 Thread Roger Mason
Gerhard Strangar writes: > Run getent hosts node012 on all hosts to see which one can't resolve > it. Thank you, that located a problem with the hosts file on some nodes. Fixed. Best wishes, Roger

Re: [slurm-users] Jobs fail on specific nodes.

2022-05-25 Thread Roger Mason
Roger Mason writes: > I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes I forgot some information: slurm 20.02.7 on FreeBSD 12.2. New information: Running this from the controller succeeds on both machines: srun -w node[002,012] hostname

[slurm-users] Jobs fail on specific nodes.

2022-05-24 Thread Roger Mason
Hello, I have a small cluster of 4 nodes. I'm seeing jobs fail on two nodes with this written to slurm-*.out: less 1x1x1_220524_121358/slurm-1368_1.out srun: error: Unable to resolve "node012": Unknown server error srun: error: fwd_tree_thread: can't find address for host node012, check slurm.

Re: [slurm-users] Slurm and MPICH

2022-01-12 Thread Roger Mason
Hello, "Mccall, Kurt E. (MSFC-EV41)" writes: > MPICH uses the PMI 1 interface by default, but for our 20.02.3 Slurm > installation, “srun –mpi=list yields” > > > > $ srun --mpi=list > > srun: MPI types are... > > srun: cray_shasta > > srun: pmi2 > > srun: none > > > > PMI 2 is there, but no

Re: [slurm-users] sacct returns nothing after reboot

2020-05-13 Thread Roger Mason
Hello, Marcus Boden writes: > the default time window starts at 00:00:00 of the current day: > -S, --starttime > Select jobs in any state after the specified time. Default > is 00:00:00 of the current day, unless the '-s' or '-j' > options are used. If the '

[slurm-users] sacct returns nothing after reboot

2020-05-12 Thread Roger Mason
Hello, Yesterday I instituted job accounting via mysql on my (FreeBSD 11.3) test cluster. The cluster consists of a machine running slurmctld+slurmdbd and two running slurmd (slurm version 20.02.1). After experiencing a slurmdbd core dump when using mysql-5.7.30 (reported on this list on May 5) I

Re: [slurm-users] slurm does not pass mca params toopenmpi?

2018-07-19 Thread Roger Mason
Hello, Michael Di Domenico writes: > did you copy the mca parameters file to all the compute nodes as well? > No need: my home directory is shared between the submit machine & the nodes. Cheers, Roger

Re: [slurm-users] slurm does not pass mca params toopenmpi?

2018-07-19 Thread Roger Mason
Hell Gilles, gil...@rist.or.jp writes: > is the home directory mounted at the same place regardless this is a > frontend or a compute node ? One host serves as both a frontend and compute node and is used to pixie boot the other compute nodes. On the frontend machine (192.168.0.100) I have: mo

Re: [slurm-users] slurm does not pass mca params to openmpi?

2018-07-19 Thread Roger Mason
Hello Paul, Paul Edmon writes: > So the recommendation I've gotten the past is to us option number 4 > from this FAQ: > > https://www.open-mpi.org/faq/?category=tuning#setting-mca-params > > This works for both mpirun and srun in slurm because its a flat file > that is read rather than options t

[slurm-users] slurm does not pass mca params to openmpi?

2018-07-19 Thread Roger Mason
Hello, I've run into a problem passing MCA parameters to openmpi2. This runs fine on the command-line: /usr/local/mpi/openmpi2/bin/mpirun --mca btl_tcp_if_include \ 192.168.0.0/24 -np 10 -hostfile ~/ompi.hosts \ ~/Software/Gulp/gulp-5.0/gulp.ompi example2 If I put the the MCA parameters in ~/op