Hello,
"Groner, Rob" writes:
> A quick test to see if it's a configuration error is to set
> config_overrides in your slurm.conf and see if the node then responds
> to scontrol update.
Thanks to all who helped. It turned out that memory was the issue. I
have now reseated the RAM in the offend
users@lists.schedmd.com
Subject: Re: [slurm-users] Nodes stuck in drain state
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is wha
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is what you think
it is.
Brian Andrus
On 5/25/2023 7:30 AM, Roger Mas
Ole Holm Nielsen writes:
> 1. Is slurmd running on the node?
Yes.
> 2. What's the output of "slurmd -C" on the node?
NodeName=node012 CPUs=4 Boards=1 SocketsPerBoard=2 CoresPerSocket=2
ThreadsPerCore=1 RealMemory=6097
> 3. Define State=UP in slurm.conf in stead of UNKNOWN
Will do.
> 4. Why h
Hello,
Davide DelVento writes:
> Can you ssh into the node and check the actual availability of memory?
> Maybe there is a zombie process (or a healthy one with a memory leak
> bug) that's hogging all the memory?
This is what top shows:
last pid: 45688; load averages: 0.00, 0.00, 0.00
On 5/25/23 15:23, Roger Mason wrote:
NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=node012 NodeHostName=node012
RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
State=UNKN
Can you ssh into the node and check the actual availability of memory?
Maybe there is a zombie process (or a healthy one with a memory leak bug)
that's hogging all the memory?
On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote:
> Hello,
>
> Doug Meyer writes:
>
> > Could also review the node log
Hello,
Doug Meyer writes:
> Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell
> you the cause, fro example mem not matching the config.
>
REASON USER TIMESTAMP STATE NODELIST
Low RealMemory slurm(468) 2023-05-25T09:26:59 drai
Ole Holm Nielsen writes:
> On 5/25/23 13:59, Roger Mason wrote:
>> slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
Yes. It is what is available in ports.
> What's the output of "scontrol show node node012"?
NodeName=node012 CoresPerSocket=2
CPUAlloc=0 CPUTot=4 CPULoad=N/A
AvailableFeat
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will
tell you the cause, fro example mem not matching the config.
Doug
On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen
wrote:
> On 5/25/23 13:59, Roger Mason wrote:
> > slurm 20.02.7 on FreeBSD.
>
> Uh, that's old!
>
> > I hav
On 5/25/23 13:59, Roger Mason wrote:
slurm 20.02.7 on FreeBSD.
Uh, that's old!
I have a couple of nodes stuck in the drain state. I have tried
scontrol update nodename=node012 state=down reason="stuck in drain state"
scontrol update nodename=node012 state=resume
without success.
I then tr
11 matches
Mail list logo