After you restart slurmctld do "scontrol reconfigure"
Brian Andrus
On 8/30/2019 6:57 AM, Robert Kudyba wrote:
I had set RealMemory to a really high number as I mis-interpreted the
recommendation.
NodeName=node[001-003] CoresPerSocket=12 RealMemory=
196489092 Sockets=2 Gres=gpu:1
But now I set it to:
RealMemory=191000
I restarted slurmctld. And according to the Bright Cluster support team:
"Unless it has been overridden in the image, the nodes will have a
symlink directly to the slurm.conf on the head node. This means that
any changes made to the file on the head node will automatically be
available to the compute nodes. All they would need in that case is to
have slurmd restarted"
But now I see these errors:
mcs: MCSParameters = (null). ondemand set.
[2019-08-30T09:22:41.700] error: Node node001 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:22:41.700] error: Node node002 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:22:41.701] error: Node node003 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2019-08-30T09:23:16.347] update_node: node node001 state set to IDLE
[2019-08-30T09:23:16.347] got (nil)
[2019-08-30T09:23:16.766]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2019-08-30T09:23:19.082] update_node: node node002 state set to IDLE
[2019-08-30T09:23:19.082] got (nil)
[2019-08-30T09:23:20.929] update_node: node node003 state set to IDLE
[2019-08-30T09:23:20.929] got (nil)
[2019-08-30T09:45:46.314] _slurm_rpc_submit_batch_job: JobId=449
InitPrio=4294901759 usec=355
[2019-08-30T09:45:46.430] sched: Allocate JobID=449
NodeList=node[001-003] #CPUs=30 Partition=defq
[2019-08-30T09:45:46.670] prolog_running_decr: Configuration for
JobID=449 is complete
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x1 NodeCnt=3
WEXITSTATUS 127
[2019-08-30T09:45:46.772] _job_complete: JobID=449 State=0x8005
NodeCnt=3 done
Is this another option that needs to be set?
On Thu, Aug 29, 2019 at 3:27 PM Alex Chekholko <a...@calicolabs.com
<mailto:a...@calicolabs.com>> wrote:
Sounds like maybe you didn't correctly roll out / update your
slurm.conf everywhere as your RealMemory value is back to your
large wrong number. You need to update your slurm.conf everywhere
and restart all the slurm daemons.
I recommend the "safe procedure" from here:
https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes
<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.fysik.dtu.dk_niflheim_SLURM-23add-2Dand-2Dremove-2Dnodes&d=DwMFaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=yUZtCS8lFs9N4Dm1nidebq1bpGa9QMJUap7ZWVR8NVg&s=Fq72zWoETitTA7ayJCyYkbp8E1fInntp4YeBv75o7vU&e=>
Your Bright manual may have a similar process for updating SLURM
config "the Bright way".
On Thu, Aug 29, 2019 at 12:20 PM Robert Kudyba
<rkud...@fordham.edu <mailto:rkud...@fordham.edu>> wrote:
I thought I had taken care of this a while back but it appears
the issue has returned. A very simply sbatch slurmhello.sh:
cat slurmhello.sh
#!/bin/sh
#SBATCH -o my.stdout
#SBATCH -N 3
#SBATCH --ntasks=16
module add shared openmpi/gcc/64/1.10.7 slurm
mpirun hello
sbatch slurmhello.sh
Submitted batch job 419
squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
419 defq slurmhel root PD 0:00 3
(Resources)
In /etc/slurm/slurm.conf:
# Nodes
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092
Sockets=2 Gres=gpu:1
Logs show:
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
node=node001: Invalid argument
[2019-08-29T14:24:40.025] error: Node node002 has low
real_memory size (191840 < 196489092)
[2019-08-29T14:24:40.025] error: _slurm_rpc_node_registration
node=node002: Invalid argument
[2019-08-29T14:24:40.026] error: Node node003 has low
real_memory size (191840 < 196489092)
[2019-08-29T14:24:40.026] error: _slurm_rpc_node_registration
node=node003: Invalid argument
scontrol show jobid -dd 419
JobId=419 JobName=slurmhello.sh
UserId=root(0) GroupId=root(0) MCS_label=N/A
Priority=4294901759 Nice=0 Account=root QOS=normal
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2019-08-28T09:54:22 EligibleTime=2019-08-28T09:54:22
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-08-28T09:57:22
Partition=defq AllocNode:Sid=ourcluster:194152
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3-3 NumCPUs=16 NumTasks=16 CPUs/Task=1
ReqB:S:C:T=0:0:*:*
TRES=cpu=16,node=3
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
Command=/root/slurmhello.sh
WorkDir=/root
StdErr=/root/my.stdout
StdIn=/dev/null
StdOut=/root/my.stdout
Power=
scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.06
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:1
NodeAddr=node001 NodeHostName=node001 Version=17.11
OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9
18:05:47 UTC 2018
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1
Owner=N/A MCS_label=N/A
Partitions=defq
BootTime=2019-07-18T12:08:41
SlurmdStartTime=2019-07-18T12:09:44
CfgTRES=cpu=24,mem=196489092M,billing=24
AllocTRES=
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
[root@ciscluster ~]# scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=99923 Sockets=2
Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=180969 Sockets=2
Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=178999 Sockets=2
Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
sinfo -R
REASON USER TIMESTAMP NODELIST
Low RealMemory slurm 2019-07-18T10:17:24 node[001-003]
sinfo -N
NODELIST NODES PARTITION STATE
node001 1 defq* drain
node002 1 defq* drain
node003 1 defq* drain
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node002: Thread(s) per core: 1
node002: Core(s) per socket: 12
node002: Socket(s): 2
node001: Thread(s) per core: 1
node001: Core(s) per socket: 12
node001: Socket(s): 2
node003: Thread(s) per core: 2
node003: Core(s) per socket: 12
node003: Socket(s): 2
scontrol show nodes| grep -i mem
RealMemory=196489092 AllocMem=0 FreeMem=100054 Sockets=2
Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=181101 Sockets=2
Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory [slurm@2019-07-18T10:17:24]
RealMemory=196489092 AllocMem=0 FreeMem=179004 Sockets=2
Boards=1
CfgTRES=cpu=24,mem=196489092M,billing=24
Reason=Low RealMemory
Does anything look off?