Re: [slurm-users] temporary SLURM directories

2022-05-26 Thread Diego Zuccato

Il 25/05/2022 14:42, Mark Dixon ha scritto:


https://slurm.schedmd.com/faq.html#tmpfs_jobcontainer
https://slurm.schedmd.com/job_container.conf.html
I would be interested in hearing how well it works - it's so buried in 
the documentation that unfortunately I didn't see it until after I 
rolled a solution similar to Diego's
Well, I found it, but IIUC it just handles tmpfs (RAM backed), but I 
needed to use actual disk space: RAM is needed for the job :)


(which can be extended such that 
TaskProlog sets the TMPDIR environment variable appropriately, and limit 
the disk space used by the job).Still can't

export TMPDIR=...
from TaskProlog script. Surely missing something important. Maybe 
TaskProlog is called as a subshell? In that case it can't alter caller's 
env... But IIUC someone made it work, and that confuses me...


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] temporary SLURM directories

2022-05-26 Thread Diego Zuccato

Il 26/05/2022 11:48, Diego Zuccato ha scritto:


Still can't
export TMPDIR=...
from TaskProlog script. Surely missing something important. Maybe 
TaskProlog is called as a subshell? In that case it can't alter caller's 
env... But IIUC someone made it work, and that confuses me...


Seems I finally managed to understand TaskProlog script! It's more 
involved than I thought. :(


The script is run (on the first allocated node, IIUC) in a subshell (so 
a direct export can't work), and *its output* is processed in the job 
shell. Please correct me if I'm wrong.

That's why the FAQ https://slurm.schedmd.com/faq.html uses lines like
echo "print ..."

Changing my TaskProlog.sh from
export TMPDIR=...
to
echo "export TMPDIR=..."
fixed it. Now I'm quite happier :)


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



[slurm-users] New slurm configuration - multiple jobs per host

2022-05-26 Thread Jake Jellinek
Hi

I am just building my first Slurm setup and have got everything running - well, 
almost.

I have a two node configuration. All of my setup exists on a single HyperV 
server and I have divided up the resources to create my VMs

One node I will use for heavy duty work; this is called compute001
One node I will use for normal work; this is called compute002

My compute node specification in slurm.conf is
NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
NodeName=compute001 CPUs=32
NodeName=compute002 CPUs=2

The partition specification is
PartitionName=DEFAULT State=UP
PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE
PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE


I have added the OverSubscribe=FORCE option as I want more than one job to be 
able to land on my interactive/simulation queues.

All of the nodes and cluster master start up fine and they all talk to each 
other but no matter what I do, I cannot get my cluster to accept more than one 
job per node.


Can you help me determine where I am going wrong?
Thanks a lot
Jake


The entire slurm.conf is pasted below
# slurm.conf file generated by configurator.html.
ClusterName=pm-slurm
SlurmctldHost=slurm-master
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/home/slurm/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/cgroup
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
#
# LOGGING AND ACCOUNTING
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log

# COMPUTE NODES
NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
NodeName=compute001 CPUs=32
NodeName=compute002 CPUs=2

PartitionName=DEFAULT State=UP
PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE
PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE




Re: [slurm-users] New slurm configuration - multiple jobs per host

2022-05-26 Thread Ole Holm Nielsen

Hi Jake,

Firstly, which Slurm version and which OS do you use?

Next, try simplifying by removing the oversubscribe configuration.  Read 
the slurm.conf manual page about oversubscribe, it looks a bit tricky.


The RealMemory=1000 is extremely low and might prevent jobs from 
starting!  Run "slurmd -C" on the nodes to read appropriate node 
parameters for slurm.conf.


I hope this helps.

/Ole


On 26-05-2022 21:12, Jake Jellinek wrote:

Hi

I am just building my first Slurm setup and have got everything running 
– well, almost.


I have a two node configuration. All of my setup exists on a single 
HyperV server and I have divided up the resources to create my VMs


One node I will use for heavy duty work; this is called compute001

One node I will use for normal work; this is called compute002

My compute node specification in slurm.conf is

NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN

NodeName=compute001 CPUs=32

NodeName=compute002 CPUs=2

The partition specification is

PartitionName=DEFAULT State=UP

PartitionName=interactive Nodes=compute002 MaxTime=INFINITE 
OverSubscribe=FORCE


PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE

I have added the OverSubscribe=FORCE option as I want more than one job 
to be able to land on my interactive/simulation queues.


All of the nodes and cluster master start up fine and they all talk to 
each other but no matter what I do, I cannot get my cluster to accept 
more than one job per node.


Can you help me determine where I am going wrong?

Thanks a lot

Jake

The entire slurm.conf is pasted below

# slurm.conf file generated by configurator.html.

ClusterName=pm-slurm

SlurmctldHost=slurm-master

MpiDefault=none

ProctrackType=proctrack/cgroup

ReturnToService=2

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

StateSaveLocation=/home/slurm/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/cgroup

#

# TIMERS

InactiveLimit=0

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

Waittime=0

#

# SCHEDULING

SchedulerType=sched/backfill

SelectType=select/cons_tres

SelectTypeParameters=CR_Core_Memory

#

# LOGGING AND ACCOUNTING

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/cgroup

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

# COMPUTE NODES

NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN

NodeName=compute001 CPUs=32

NodeName=compute002 CPUs=2

PartitionName=DEFAULT State=UP

PartitionName=interactive Nodes=compute002 MaxTime=INFINITE 
OverSubscribe=FORCE


PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE






Re: [slurm-users] New slurm configuration - multiple jobs per host

2022-05-26 Thread Jake Jellinek
Hi Ole

I only added the oversubscribe option because without it, it didn’t work - so 
in fact, it appears not to have made any difference

I though the RealMemory option just said not to offer any jobs to the node that 
didn’t have AT LEAST that amount of RAM
My large node has more than 64GB RAM (and more will be allocated later) but I 
have yet to get to a memory issue…still working on cores


jake@compute001:~$ slurmd -C
NodeName=compute001 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 
ThreadsPerCore=2 RealMemory=64359
UpTime=0-06:58:54


Thanks
Jake

> On 26 May 2022, at 21:11, Ole Holm Nielsen  wrote:
> 
> Hi Jake,
> 
> Firstly, which Slurm version and which OS do you use?
> 
> Next, try simplifying by removing the oversubscribe configuration.  Read the 
> slurm.conf manual page about oversubscribe, it looks a bit tricky.
> 
> The RealMemory=1000 is extremely low and might prevent jobs from starting!  
> Run "slurmd -C" on the nodes to read appropriate node parameters for 
> slurm.conf.
> 
> I hope this helps.
> 
> /Ole
> 
> 
>> On 26-05-2022 21:12, Jake Jellinek wrote:
>> Hi
>> I am just building my first Slurm setup and have got everything running – 
>> well, almost.
>> I have a two node configuration. All of my setup exists on a single HyperV 
>> server and I have divided up the resources to create my VMs
>> One node I will use for heavy duty work; this is called compute001
>> One node I will use for normal work; this is called compute002
>> My compute node specification in slurm.conf is
>> NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
>> NodeName=compute001 CPUs=32
>> NodeName=compute002 CPUs=2
>> The partition specification is
>> PartitionName=DEFAULT State=UP
>> PartitionName=interactive Nodes=compute002 MaxTime=INFINITE 
>> OverSubscribe=FORCE
>> PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE
>> I have added the OverSubscribe=FORCE option as I want more than one job to 
>> be able to land on my interactive/simulation queues.
>> All of the nodes and cluster master start up fine and they all talk to each 
>> other but no matter what I do, I cannot get my cluster to accept more than 
>> one job per node.
>> Can you help me determine where I am going wrong?
>> Thanks a lot
>> Jake
>> The entire slurm.conf is pasted below
>> # slurm.conf file generated by configurator.html.
>> ClusterName=pm-slurm
>> SlurmctldHost=slurm-master
>> MpiDefault=none
>> ProctrackType=proctrack/cgroup
>> ReturnToService=2
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/spool/slurmd
>> SlurmUser=slurm
>> StateSaveLocation=/home/slurm/var/spool/slurmctld
>> SwitchType=switch/none
>> TaskPlugin=task/cgroup
>> #
>> # TIMERS
>> InactiveLimit=0
>> KillWait=30
>> MinJobAge=300
>> SlurmctldTimeout=120
>> SlurmdTimeout=300
>> Waittime=0
>> #
>> # SCHEDULING
>> SchedulerType=sched/backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory
>> #
>> # LOGGING AND ACCOUNTING
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/cgroup
>> SlurmctldDebug=info
>> SlurmctldLogFile=/var/log/slurmctld.log
>> SlurmdDebug=info
>> SlurmdLogFile=/var/log/slurmd.log
>> # COMPUTE NODES
>> NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN
>> NodeName=compute001 CPUs=32
>> NodeName=compute002 CPUs=2
>> PartitionName=DEFAULT State=UP
>> PartitionName=interactive Nodes=compute002 MaxTime=INFINITE 
>> OverSubscribe=FORCE
>> PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE
> 
> 


[slurm-users] Slurm version 22.05 is now available

2022-05-26 Thread Tim Wickberg

We are pleased to announce the availability of Slurm release 22.05.0.

To highlight some new features in 22.05:

- Support for dynamic node addition and removal
  (https://slurm.schedmd.com/dynamic_nodes.html)
- Support for native Linux cgroup v2 operation
- Newly added plugins to support HPE Slingshot 11 networks
  (switch/hpe_slingshot), and Intel Xe GPUs (gpu/oneapi)
- Added new acct_gather_interconnect/sysfs plugin to collect statistics
  from arbitrary network interfaces.
- Expanded and synced set of environment variables available in the
  Prolog/Epilog/PrologSlurmctld/EpilogSlurmctld scripts.
- New "--prefer" option to job submissions to allow for a "soft
  constraint" request to influence node selection.
- Optional support for license planning in the backfill scheduler with
  "bf_licenses" option in SchedulerParameters.

The main Slurm documentation site at https://slurm.schedmd.com/ has been 
updated now as well.


Slurm can be downloaded from https://www.schedmd.com/downloads.php .

- Tim

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support