from:"Dan Healy via slurm\-users"

[slurm-users] Question about IB and Ethernet networks

2024-02-25 Thread Dan Healy via slurm-users

Hi Fellow Slurm Users,

This question is not slurm-specific, but it might develop into that.

My question relates to understanding how *typical* HPCs are designed in
terms of networking. To start, is it typical for there to be a high speed
Ethernet *and* Infiniband networks (meaning separate switches, NICs)? I
know you can easily setup IP over IB, but is IB usually fully reserved for
MPI messages? I’m tempted to spec all new HPCs with only a high speed
(200Gbps) IB network, and use IPoIB for all slurm comms with compute nodes.
I plan on using BeeGFS for the file system with RDMA.

Just looking for some feedback, please. Is this OK? Is there a better way?
If yes, please share why it’s better.

Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Dan Healy via slurm-users

I’m very appreciative for each person who’s provided some feedback,
especially the lengthy replies.

Sounds like RoCE capable Ethernet backbone may be the default way to go
*unless* the end users have some specific requirements that might need IB.
At this point, we wouldn’t be interested in anything slower than 200Gbps.
So perhaps Eth and IB are equivalent in terms of latency and RDMA
capabilities, except one is an open standard.

Thanks,

Daniel Healy


On Mon, Feb 26, 2024 at 3:40 AM Cutts, Tim 
wrote:

> My view is that it depends entirely on the workload, and the systems with
> which your compute needs to interact.  A few things I’ve experienced before.
>
>
>
>1. Modern ethernet networks have pretty good latency these days, and
>so MPI codes can run over them.   Whether IB is worth the money is a
>cost/benefit calculation for the codes you want to run.  The ethernet
>network we put in at Sanger in 2016 or so we measured as having similar
>latency, in practice, as FDR infiniband, if I remember correctly.  So it
>wasn’t as good as state-of-the-art IB at the time, but not bad.  Certainly
>good enough for our purposes, and we gained a lot of flexibility through
>software-defined networking, important if you have workloads which require
>better security boundaries than just a big shared network.
>2. If your workload is predominantly single node, embarrassingly
>parallel, you might do better to go with ethernet and invest the saved
>money in more compute nodes.
>3. If you only have ethernet, your cluster will be simpler, and
>require less specialised expertise to run
>4. If your parallel filesystem is Lustre, IB seems to be the more
>well-worn path than ethernet.  We encountered a few Lustre bugs early on
>because of that.
>5. On the other hand, if you need to talk to Weka, ethernet is the
>well-worn path.  Weka’s IB implementation requires the dedication of some
>cores on every client node, so you lose some compute capacity, which you
>don’t need to do if you’re using ethernet.
>
>
>
> So, as any lawyer would say “it depends”.  Most of my career has been in
> genomics, where IB definitely wasn’t necessary.  Now that I’m in pharma,
> there’s more MPI code, so there’s more of a case for it.
>
>
>
> Ultimately, I think you need to run the real benchmarks with real code,
> and as Jason says, work out whether the additional complexity and cost of
> the IB network is worth it for your particular workload.  I don’t think the
> mantra “It’s HPC so it has to be Infiniband” is a given.
>
>
>
> Tim
>
>
>
> --
>
> *Tim Cutts*
>
> Scientific Computing Platform Lead
>
> AstraZeneca
>
>
>
> Find out more about R&D IT Data, Analytics & AI and how we can support you
> by visiting our Service Catalogue
> <https://azcollaboration.sharepoint.com/sites/CMU993> |
>
>
>
>
>
> *From: *Jason Simms via slurm-users 
> *Date: *Monday, 26 February 2024 at 01:13
> *To: *Dan Healy 
> *Cc: *slurm-users@lists.schedmd.com 
> *Subject: *[slurm-users] Re: Question about IB and Ethernet networks
>
> Hello Daniel,
>
>
>
> In my experience, if you have a high-speed interconnect such as IB, you
> would do IPoIB. You would likely still have a "regular" Ethernet connection
> for management purposes, and yes that means both an IB switch and an
> Ethernet switch, but that switch doesn't have to be anything special. Any
> "real" traffic is routed over IB, everything is mounted via IB, etc. That's
> how the last two clusters I've worked with have been configured, and the
> next one will be the same (but will use Omnipath rather than IB). We
> likewise use BeeGFS.
>
>
>
> These next comments are perhaps more likely to encounter differences of
> opinion, but I would say that sufficiently fast Ethernet is often "good
> enough" for most workloads (e.g., MPI). I'd wager that for all but the most
> demanding of workloads, it's entirely acceptable. You'll also save a bit of
> money, of course. HOWEVER, I do think there is, shall we say, an
> expectation from many researchers that any cluster worth its salt will have
> some kind of fast interconnect, even if at the scale of most on-prem work,
> you might be hard-pressed in real-world conditions to notice much of a
> difference. If you're running jobs that take weeks and hundreds of nodes,
> the time (and other) savings may add up, but if we're talking the
> difference between a job running on 5 nodes taking 48 hours vs. slightly
> less, then?? Your mileage may vary, as they say...
>
>
>
> Warmest regards,
>
> Jason
>
>
>
>

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Dan Healy via slurm-users

Are most of us using HAProxy or something else?

On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Magnus,
>
> That is a feature of the load balancer. Most of them have that these days.
>
> Brian Andrus
>
> On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
> > On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
> >> for us, we put a load balancer in front of the login nodes with
> >> session
> >> affinity enabled. This makes them land on the same backend node each
> >> time.
> > Hi Brian,
> > that sounds interesting - how did you implement session affinity?
> > cheers
> > magnus
> >
> >
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Convergence of Kube and Slurm?

2024-05-04 Thread Dan Healy via slurm-users

Bright Cluster Manager has some verbiage on their marketing site that they
can manage a cluster running both Kubernetes and Slurm. Maybe I
misunderstood it. But nevertheless, I am encountering groups more
frequently that want to run a stack of containers that need private
container networking.

What’s the current state of using the same HPC cluster for both Slurm and
Kube?

Note: I’m aware that I can run Kube on a single node, but we need more
resources. So ultimately we need a way to have Slurm and Kube exist in the
same cluster, both sharing the full amount of resources and both being
fully aware of resource usage.

Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Executing srun -n X where X is greater than total CPU in entire cluster

2024-05-16 Thread Dan Healy via slurm-users

Hi there, SLURM community,

I swear I've done this before, but now it's failing on a new cluster I'm
deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I
run `srun -n 500 hostname`, the task gets queued since there's not 500
available CPU.

Wasn't there an option that allows for this to be run where the first 384
tasks execute, and then the remaining execute when resources free up?

Here's my conf:

# Slurm Cgroup Configs used on controllers and
workersslurm_cgroup_config:  CgroupAutomount: yes  ConstrainCores: yes
 ConstrainRAMSpace: yes  ConstrainSwapSpace: yes  ConstrainDevices:
yes# Slurm conf file settingsslurm_config:  AccountingStorageType:
"accounting_storage/slurmdbd"  AccountingStorageEnforce: "limits"
AuthAltTypes: "auth/jwt"  ClusterName: "cluster"
AccountingStorageHost : "{{
hostvars[groups['controller'][0]].ansible_hostname }}"  DefMemPerCPU:
1024  InactiveLimit: 120  JobAcctGatherType: "jobacct_gather/cgroup"
JobCompType: "jobcomp/none"  MailProg: "/usr/bin/mail"  MaxArraySize:
4  MaxJobCount: 10  MinJobAge: 3600  ProctrackType:
"proctrack/cgroup"  ReturnToService: 2  SelectType: "select/cons_tres"
 SelectTypeParameters: "CR_Core_Memory"  SlurmctldTimeout: 30
SlurmctldLogFile: "/var/log/slurm/slurmctld.log"  SlurmdLogFile:
"/var/log/slurm/slurmd.log"  SlurmdSpoolDir: "/var/spool/slurm/d"
SlurmUser: "{{ slurm_user.name }}"  SrunPortRange: "6-61000"
StateSaveLocation: "/var/spool/slurm/ctld"  TaskPlugin:
"task/affinity,task/cgroup"  UnkillableStepTimeout: 120


-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster

2024-05-30 Thread Dan Healy via slurm-users

Following up on this in case anyone can provide some insight, please.

On Thu, May 16, 2024 at 8:32 AM Dan Healy  wrote:

> Hi there, SLURM community,
>
> I swear I've done this before, but now it's failing on a new cluster I'm
> deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I
> run `srun -n 500 hostname`, the task gets queued since there's not 500
> available CPU.
>
> Wasn't there an option that allows for this to be run where the first 384
> tasks execute, and then the remaining execute when resources free up?
>
> Here's my conf:
>
> # Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config:  
> CgroupAutomount: yes  ConstrainCores: yes  ConstrainRAMSpace: yes  
> ConstrainSwapSpace: yes  ConstrainDevices: yes# Slurm conf file 
> settingsslurm_config:  AccountingStorageType: "accounting_storage/slurmdbd"  
> AccountingStorageEnforce: "limits"  AuthAltTypes: "auth/jwt"  ClusterName: 
> "cluster"  AccountingStorageHost : "{{ 
> hostvars[groups['controller'][0]].ansible_hostname }}"  DefMemPerCPU: 1024  
> InactiveLimit: 120  JobAcctGatherType: "jobacct_gather/cgroup"  JobCompType: 
> "jobcomp/none"  MailProg: "/usr/bin/mail"  MaxArraySize: 4  MaxJobCount: 
> 10  MinJobAge: 3600  ProctrackType: "proctrack/cgroup"  ReturnToService: 
> 2  SelectType: "select/cons_tres"  SelectTypeParameters: "CR_Core_Memory"  
> SlurmctldTimeout: 30  SlurmctldLogFile: "/var/log/slurm/slurmctld.log"  
> SlurmdLogFile: "/var/log/slurm/slurmd.log"  SlurmdSpoolDir: 
> "/var/spool/slurm/d"  SlurmUser: "{{ slurm_user.name }}"  SrunPortRange: 
> "6-61000"  StateSaveLocation: "/var/spool/slurm/ctld"  TaskPlugin: 
> "task/affinity,task/cgroup"  UnkillableStepTimeout: 120
>
>
> --
> Thanks,
>
> Daniel Healy
>


-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Can SLURM queue different jobs to start concurrently?

2024-07-08 Thread Dan Healy via slurm-users

Hi there,

I've received a question from an end user, which I presume the answer is
"No", but would like to ask the community first.

Scenario: The user wants to create a series of jobs that all need to start
at the same time. Example: there are 10 different executable applications
which have varying CPU and RAM constraints, all of which need to
communicate via TCP/IP. Of course the user could design some type of
idle/statusing mechanism to wait until all jobs are *randomly *started,
then begin execution, but this feels like a waste of resources. The
complete execution of these 10 applications would be considered a single
simulation. The goal would be to distribute these 10 applications across
the cluster and not necessarily require them all to execute on a single
node.

Is there a good architecture for this using SLURM? If so, please kindly
point me in the right direction.

-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: getting slurm going

2024-12-08 Thread Dan Healy via slurm-users

sinfo
srun hostname

Thanks,

Daniel Healy


On Sun, Dec 8, 2024 at 2:30 PM Steven Jones via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> What tests can I do to prove that slurm is talking to the nodes pls?
>
>
>
>
> regards
>
> Steven
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Unexpected node got allocation

2025-01-09 Thread Dan Healy via slurm-users

Hello there and good morning from Baltimore.

I have a small cluster with 100 nodes. When the cluster is completely empty
of all jobs, the first job gets allocated to node 41. In other clusters,
the first job gets allocated to mode 01. If I specify node 01, the
allocation works perfectly. I have my partition NodeName set as
node[01-99], so having node41 used first is a surprise to me. We also have
many other partitions which start with node41, but the partition being used
for the allocation starts with node01.

Does anyone know what would cause this?

Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unexpected node got allocation

2025-01-09 Thread Dan Healy via slurm-users

No, sadly there’s no topology.conf in use.

Thanks,

Daniel Healy


On Thu, Jan 9, 2025 at 8:28 AM Steffen Grunewald <
steffen.grunew...@aei.mpg.de> wrote:

> On Thu, 2025-01-09 at 07:51:40 -0500, Slurm users wrote:
> > Hello there and good morning from Baltimore.
> >
> > I have a small cluster with 100 nodes. When the cluster is completely
> empty
> > of all jobs, the first job gets allocated to node 41. In other clusters,
> > the first job gets allocated to mode 01. If I specify node 01, the
> > allocation works perfectly. I have my partition NodeName set as
> > node[01-99], so having node41 used first is a surprise to me. We also
> have
> > many other partitions which start with node41, but the partition being
> used
> > for the allocation starts with node01.
> >
> > Does anyone know what would cause this?
>
> Just a wild guess, but do you have a topology.conf file that somehow makes
> this node look most reasonable to use for a single-node job?
> (Topology attempts to assign, or hold back, sections of your network to
> maximize interconnect bandwidth for multi-node jobs. Your node41 might be
> one - or the first one of a series - that would leave bigger chunks unused
> for bigger tasks.)
>
> HTH,
>  Steffen
>
> --
> Steffen Grunewald, Cluster Administrator
> Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
> Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
> ~~~
> Fon: +49-331-567 7274
> Mail: steffen.grunewald(at)aei.mpg.de
> ~~~
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Priority/Top seems to not be working

2025-03-24 Thread Dan Healy via slurm-users

Hi Slurm Users,

I have a newer install (23.11.3) and the priority of most all jobs is 1 and
other small numbers. In previous versions, I would see numbers like 2^32. I
have the multifactor plugin configured and confirmed it's in-use when I
show the config.

When I run `scontrol top` for a given job, the priority number doesn't
change from 1. I have lots of jobs running at the moment and many queued as
well. I've been using `scontrol top` for years and have seen the priority
successfully change. I'm repeating this now with the newer version of Slurm
and now seeing it work the same.

Are you seeing this too?


-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] slurmrestd equivalent to "srun -n 10 echo HELLO"

2025-03-24 Thread Dan Healy via slurm-users

Hi Slurm Community,

I'm starting to experiment with slurmrestd for a new app we're writing. I'm
having trouble understanding one aspect of submitting jobs.

When I run something like `srun -n 10 echo HELLO', I get HELLO returned to
my console/stdout 10x.
When I submit this command as a script to the /jobs/submit route, I get
success/200, but *I cannot determine how to get the console output of HELLO
10x in any form*. It's not in my stdout log for that job even though I can
verify that the job ran successfully.

Any suggestions?

-- 
Thanks,

Daniel Healy

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Question about IB and Ethernet networks

[slurm-users] Re: Question about IB and Ethernet networks

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

[slurm-users] Convergence of Kube and Slurm?

[slurm-users] Executing srun -n X where X is greater than total CPU in entire cluster

[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster

[slurm-users] Can SLURM queue different jobs to start concurrently?

[slurm-users] Re: getting slurm going

[slurm-users] Unexpected node got allocation

[slurm-users] Re: Unexpected node got allocation

[slurm-users] Priority/Top seems to not be working

[slurm-users] slurmrestd equivalent to "srun -n 10 echo HELLO"

12 matches

Site Navigation

Mail list logo

Footer information