Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, "Groner, Rob" writes: > A quick test to see if it's a configuration error is to set > config_overrides in your slurm.conf and see if the node then responds > to scontrol update. Thanks to all who helped. It turned out that memory was the issue. I have now reseated the RAM in the offend

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Groner, Rob
users@lists.schedmd.com Subject: Re: [slurm-users] Nodes stuck in drain state That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is wha

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Brian Andrus
That output of slurmd -C is your answer. Slurmd only sees 6GB of memory and you are claiming it has 10GB. I would run some memtests, look at meminfo on the node, etc. Maybe even check that the type/size of memory in there is what you think it is. Brian Andrus On 5/25/2023 7:30 AM, Roger Mas

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > 1. Is slurmd running on the node? Yes. > 2. What's the output of "slurmd -C" on the node? NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=6097 > 3. Define State=UP in slurm.conf in stead of UNKNOWN Will do. > 4. Why h

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Davide DelVento writes: > Can you ssh into the node and check the actual availability of memory? > Maybe there is a zombie process (or a healthy one with a memory leak > bug) that's hogging all the memory? This is what top shows: last pid: 45688; load averages: 0.00, 0.00, 0.00

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 15:23, Roger Mason wrote: NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node012 NodeHostName=node012 RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1 State=UNKN

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Davide DelVento
Can you ssh into the node and check the actual availability of memory? Maybe there is a zombie process (or a healthy one with a memory leak bug) that's hogging all the memory? On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote: > Hello, > > Doug Meyer writes: > > > Could also review the node log

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Hello, Doug Meyer writes: > Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell > you the cause, fro example mem not matching the config. > REASON USER TIMESTAMP STATE NODELIST Low RealMemory slurm(468) 2023-05-25T09:26:59 drai

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Roger Mason
Ole Holm Nielsen writes: > On 5/25/23 13:59, Roger Mason wrote: >> slurm 20.02.7 on FreeBSD. > > Uh, that's old! Yes. It is what is available in ports. > What's the output of "scontrol show node node012"? NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeat

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Doug Meyer
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config. Doug On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen wrote: > On 5/25/23 13:59, Roger Mason wrote: > > slurm 20.02.7 on FreeBSD. > > Uh, that's old! > > > I hav

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 13:59, Roger Mason wrote: slurm 20.02.7 on FreeBSD. Uh, that's old! I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then tr