[slurm-users] Question about IB and Ethernet networks
Hi Fellow Slurm Users, This question is not slurm-specific, but it might develop into that. My question relates to understanding how *typical* HPCs are designed in terms of networking. To start, is it typical for there to be a high speed Ethernet *and* Infiniband networks (meaning separate switches, NICs)? I know you can easily setup IP over IB, but is IB usually fully reserved for MPI messages? I’m tempted to spec all new HPCs with only a high speed (200Gbps) IB network, and use IPoIB for all slurm comms with compute nodes. I plan on using BeeGFS for the file system with RDMA. Just looking for some feedback, please. Is this OK? Is there a better way? If yes, please share why it’s better. Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Question about IB and Ethernet networks
I’m very appreciative for each person who’s provided some feedback, especially the lengthy replies. Sounds like RoCE capable Ethernet backbone may be the default way to go *unless* the end users have some specific requirements that might need IB. At this point, we wouldn’t be interested in anything slower than 200Gbps. So perhaps Eth and IB are equivalent in terms of latency and RDMA capabilities, except one is an open standard. Thanks, Daniel Healy On Mon, Feb 26, 2024 at 3:40 AM Cutts, Tim wrote: > My view is that it depends entirely on the workload, and the systems with > which your compute needs to interact. A few things I’ve experienced before. > > > >1. Modern ethernet networks have pretty good latency these days, and >so MPI codes can run over them. Whether IB is worth the money is a >cost/benefit calculation for the codes you want to run. The ethernet >network we put in at Sanger in 2016 or so we measured as having similar >latency, in practice, as FDR infiniband, if I remember correctly. So it >wasn’t as good as state-of-the-art IB at the time, but not bad. Certainly >good enough for our purposes, and we gained a lot of flexibility through >software-defined networking, important if you have workloads which require >better security boundaries than just a big shared network. >2. If your workload is predominantly single node, embarrassingly >parallel, you might do better to go with ethernet and invest the saved >money in more compute nodes. >3. If you only have ethernet, your cluster will be simpler, and >require less specialised expertise to run >4. If your parallel filesystem is Lustre, IB seems to be the more >well-worn path than ethernet. We encountered a few Lustre bugs early on >because of that. >5. On the other hand, if you need to talk to Weka, ethernet is the >well-worn path. Weka’s IB implementation requires the dedication of some >cores on every client node, so you lose some compute capacity, which you >don’t need to do if you’re using ethernet. > > > > So, as any lawyer would say “it depends”. Most of my career has been in > genomics, where IB definitely wasn’t necessary. Now that I’m in pharma, > there’s more MPI code, so there’s more of a case for it. > > > > Ultimately, I think you need to run the real benchmarks with real code, > and as Jason says, work out whether the additional complexity and cost of > the IB network is worth it for your particular workload. I don’t think the > mantra “It’s HPC so it has to be Infiniband” is a given. > > > > Tim > > > > -- > > *Tim Cutts* > > Scientific Computing Platform Lead > > AstraZeneca > > > > Find out more about R&D IT Data, Analytics & AI and how we can support you > by visiting our Service Catalogue > <https://azcollaboration.sharepoint.com/sites/CMU993> | > > > > > > *From: *Jason Simms via slurm-users > *Date: *Monday, 26 February 2024 at 01:13 > *To: *Dan Healy > *Cc: *slurm-users@lists.schedmd.com > *Subject: *[slurm-users] Re: Question about IB and Ethernet networks > > Hello Daniel, > > > > In my experience, if you have a high-speed interconnect such as IB, you > would do IPoIB. You would likely still have a "regular" Ethernet connection > for management purposes, and yes that means both an IB switch and an > Ethernet switch, but that switch doesn't have to be anything special. Any > "real" traffic is routed over IB, everything is mounted via IB, etc. That's > how the last two clusters I've worked with have been configured, and the > next one will be the same (but will use Omnipath rather than IB). We > likewise use BeeGFS. > > > > These next comments are perhaps more likely to encounter differences of > opinion, but I would say that sufficiently fast Ethernet is often "good > enough" for most workloads (e.g., MPI). I'd wager that for all but the most > demanding of workloads, it's entirely acceptable. You'll also save a bit of > money, of course. HOWEVER, I do think there is, shall we say, an > expectation from many researchers that any cluster worth its salt will have > some kind of fast interconnect, even if at the scale of most on-prem work, > you might be hard-pressed in real-world conditions to notice much of a > difference. If you're running jobs that take weeks and hundreds of nodes, > the time (and other) savings may add up, but if we're talking the > difference between a job running on 5 nodes taking 48 hours vs. slightly > less, then?? Your mileage may vary, as they say... > > > > Warmest regards, > > Jason > > > >
[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?
Are most of us using HAProxy or something else? On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote: > Magnus, > > That is a feature of the load balancer. Most of them have that these days. > > Brian Andrus > > On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote: > > On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: > >> for us, we put a load balancer in front of the login nodes with > >> session > >> affinity enabled. This makes them land on the same backend node each > >> time. > > Hi Brian, > > that sounds interesting - how did you implement session affinity? > > cheers > > magnus > > > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Convergence of Kube and Slurm?
Bright Cluster Manager has some verbiage on their marketing site that they can manage a cluster running both Kubernetes and Slurm. Maybe I misunderstood it. But nevertheless, I am encountering groups more frequently that want to run a stack of containers that need private container networking. What’s the current state of using the same HPC cluster for both Slurm and Kube? Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage. Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Executing srun -n X where X is greater than total CPU in entire cluster
Hi there, SLURM community, I swear I've done this before, but now it's failing on a new cluster I'm deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I run `srun -n 500 hostname`, the task gets queued since there's not 500 available CPU. Wasn't there an option that allows for this to be run where the first 384 tasks execute, and then the remaining execute when resources free up? Here's my conf: # Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config: CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes ConstrainSwapSpace: yes ConstrainDevices: yes# Slurm conf file settingsslurm_config: AccountingStorageType: "accounting_storage/slurmdbd" AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName: "cluster" AccountingStorageHost : "{{ hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024 InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType: "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 4 MaxJobCount: 10 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService: 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory" SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log" SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name }}" SrunPortRange: "6-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin: "task/affinity,task/cgroup" UnkillableStepTimeout: 120 -- Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Executing srun -n X where X is greater than total CPU in entire cluster
Following up on this in case anyone can provide some insight, please. On Thu, May 16, 2024 at 8:32 AM Dan Healy wrote: > Hi there, SLURM community, > > I swear I've done this before, but now it's failing on a new cluster I'm > deploying. We have 6 compute nodes with 64 cpu each (384 CPU total). When I > run `srun -n 500 hostname`, the task gets queued since there's not 500 > available CPU. > > Wasn't there an option that allows for this to be run where the first 384 > tasks execute, and then the remaining execute when resources free up? > > Here's my conf: > > # Slurm Cgroup Configs used on controllers and workersslurm_cgroup_config: > CgroupAutomount: yes ConstrainCores: yes ConstrainRAMSpace: yes > ConstrainSwapSpace: yes ConstrainDevices: yes# Slurm conf file > settingsslurm_config: AccountingStorageType: "accounting_storage/slurmdbd" > AccountingStorageEnforce: "limits" AuthAltTypes: "auth/jwt" ClusterName: > "cluster" AccountingStorageHost : "{{ > hostvars[groups['controller'][0]].ansible_hostname }}" DefMemPerCPU: 1024 > InactiveLimit: 120 JobAcctGatherType: "jobacct_gather/cgroup" JobCompType: > "jobcomp/none" MailProg: "/usr/bin/mail" MaxArraySize: 4 MaxJobCount: > 10 MinJobAge: 3600 ProctrackType: "proctrack/cgroup" ReturnToService: > 2 SelectType: "select/cons_tres" SelectTypeParameters: "CR_Core_Memory" > SlurmctldTimeout: 30 SlurmctldLogFile: "/var/log/slurm/slurmctld.log" > SlurmdLogFile: "/var/log/slurm/slurmd.log" SlurmdSpoolDir: > "/var/spool/slurm/d" SlurmUser: "{{ slurm_user.name }}" SrunPortRange: > "6-61000" StateSaveLocation: "/var/spool/slurm/ctld" TaskPlugin: > "task/affinity,task/cgroup" UnkillableStepTimeout: 120 > > > -- > Thanks, > > Daniel Healy > -- Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Can SLURM queue different jobs to start concurrently?
Hi there, I've received a question from an end user, which I presume the answer is "No", but would like to ask the community first. Scenario: The user wants to create a series of jobs that all need to start at the same time. Example: there are 10 different executable applications which have varying CPU and RAM constraints, all of which need to communicate via TCP/IP. Of course the user could design some type of idle/statusing mechanism to wait until all jobs are *randomly *started, then begin execution, but this feels like a waste of resources. The complete execution of these 10 applications would be considered a single simulation. The goal would be to distribute these 10 applications across the cluster and not necessarily require them all to execute on a single node. Is there a good architecture for this using SLURM? If so, please kindly point me in the right direction. -- Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: getting slurm going
sinfo srun hostname Thanks, Daniel Healy On Sun, Dec 8, 2024 at 2:30 PM Steven Jones via slurm-users < slurm-users@lists.schedmd.com> wrote: > What tests can I do to prove that slurm is talking to the nodes pls? > > > > > regards > > Steven > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Unexpected node got allocation
Hello there and good morning from Baltimore. I have a small cluster with 100 nodes. When the cluster is completely empty of all jobs, the first job gets allocated to node 41. In other clusters, the first job gets allocated to mode 01. If I specify node 01, the allocation works perfectly. I have my partition NodeName set as node[01-99], so having node41 used first is a surprise to me. We also have many other partitions which start with node41, but the partition being used for the allocation starts with node01. Does anyone know what would cause this? Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unexpected node got allocation
No, sadly there’s no topology.conf in use. Thanks, Daniel Healy On Thu, Jan 9, 2025 at 8:28 AM Steffen Grunewald < steffen.grunew...@aei.mpg.de> wrote: > On Thu, 2025-01-09 at 07:51:40 -0500, Slurm users wrote: > > Hello there and good morning from Baltimore. > > > > I have a small cluster with 100 nodes. When the cluster is completely > empty > > of all jobs, the first job gets allocated to node 41. In other clusters, > > the first job gets allocated to mode 01. If I specify node 01, the > > allocation works perfectly. I have my partition NodeName set as > > node[01-99], so having node41 used first is a surprise to me. We also > have > > many other partitions which start with node41, but the partition being > used > > for the allocation starts with node01. > > > > Does anyone know what would cause this? > > Just a wild guess, but do you have a topology.conf file that somehow makes > this node look most reasonable to use for a single-node job? > (Topology attempts to assign, or hold back, sections of your network to > maximize interconnect bandwidth for multi-node jobs. Your node41 might be > one - or the first one of a series - that would leave bigger chunks unused > for bigger tasks.) > > HTH, > Steffen > > -- > Steffen Grunewald, Cluster Administrator > Max Planck Institute for Gravitational Physics (Albert Einstein Institute) > Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany > ~~~ > Fon: +49-331-567 7274 > Mail: steffen.grunewald(at)aei.mpg.de > ~~~ > -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Priority/Top seems to not be working
Hi Slurm Users, I have a newer install (23.11.3) and the priority of most all jobs is 1 and other small numbers. In previous versions, I would see numbers like 2^32. I have the multifactor plugin configured and confirmed it's in-use when I show the config. When I run `scontrol top` for a given job, the priority number doesn't change from 1. I have lots of jobs running at the moment and many queued as well. I've been using `scontrol top` for years and have seen the priority successfully change. I'm repeating this now with the newer version of Slurm and now seeing it work the same. Are you seeing this too? -- Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] slurmrestd equivalent to "srun -n 10 echo HELLO"
Hi Slurm Community, I'm starting to experiment with slurmrestd for a new app we're writing. I'm having trouble understanding one aspect of submitting jobs. When I run something like `srun -n 10 echo HELLO', I get HELLO returned to my console/stdout 10x. When I submit this command as a script to the /jobs/submit route, I get success/200, but *I cannot determine how to get the console output of HELLO 10x in any form*. It's not in my stdout log for that job even though I can verify that the job ran successfully. Any suggestions? -- Thanks, Daniel Healy -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com