[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Renfro, Michael via slurm-users
We only do isolated on the students’ VirtualBox setups because it’s simpler for them to get started with. Our production HPC with OpenHPC is definitely integrated with our Active Directory (directly via sssd, not with an intermediate product), etc. Not everyone does it that way, but our scale is

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Renfro, Michael via slurm-users
Late to the party here, but depending on how much time you have invested, how much you can tolerate reformats or other more destructive work, etc., you might consider OpenHPC and its install guide ([1] for RHEL 8 variants, [2] or [3] for RHEL 9 variants, depending on which version of Warewulf yo

[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Renfro, Michael via slurm-users
As Thomas had mentioned earlier in the thread, there is --exclusive with no extra additions. But that’d prevent *every* other job from running on that node, which unless this is a cluster for you and you alone, sounds like wasting 90% of the resources. I’d be most perturbed at a user doing that

[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Renfro, Michael via slurm-users
I’ve never done this myself, but others probably have. At the end of [1], there’s an example of making a generic resource for bandwidth. You could set that to any convenient units (bytes/second or bits/second, most likely), and assign your nodes a certain amount. Then any network-intensive job c

[slurm-users] Re: How can I make sure my user have only one job per node (Job array --exclusive=user,)

2024-12-03 Thread Renfro, Michael via slurm-users
I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work: 1. exclusive only ensures that others’ jobs don’t run on a node with your jobs, and does nothing about other jobs you own. 2. spread-job distributes the work of on

[slurm-users] Re: How does --nodes=min[-max] determine number of nodes to allocate?

2024-10-08 Thread Renfro, Michael via slurm-users
Not so much about the source, but in the sbatch documentation [1], I think the --begin and --nodes parameters might interact. And yes, this is semi-educated speculation on my part. From the nodes= section, “The job will be allocated as many nodes as possible within the range specified and witho

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-25 Thread Renfro, Michael via slurm-users
Since nobody replied after this, if the nodes are incapable of running the jobs due to insufficient resources, it may be that the default “EnforcePartLimits=No” [1] might be an issue. That might allow a job to stay queued even if it’s impossible to run. [1] https://slurm.schedmd.com/slurm.conf.

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-24 Thread Renfro, Michael via slurm-users
Do you have backfill scheduling [1] enabled? If so, what settings are in place? And the lower-priority jobs will only be eligible for backfill if and only if they don’t delay the start of the higher priority jobs. So what kind of resources and time does a given array job require? Odds are, they

[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-24 Thread Renfro, Michael via slurm-users
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”. So a few questions: 1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and fini

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Renfro, Michael via slurm-users
..etc etc etc... Does that look right? On Aug 9, 2024, at 4:05 PM, Renfro, Michael via slurm-users wrote: External Email - Use Caution I don’t have any 21.08 systems to verify with, but that’s how I remember it. Use “sshare -a -A mic” to verify. You should see both a RawShares

[slurm-users] Re: FairShare if there's only one account?

2024-08-09 Thread Renfro, Michael via slurm-users
I don’t have any 21.08 systems to verify with, but that’s how I remember it. Use “sshare -a -A mic” to verify. You should see both a RawShares and a NormShares column for each user. By default they’ll all have the same value, but they can be adjusted if needed. From: Drucker, Daniel via slurm-u

[slurm-users] Re: The issue in the distribution of job

2024-08-09 Thread Renfro, Michael via slurm-users
It may be difficult to narrow down the problem without knowing what commands you're running inside the salloc session. For example, if it's a pure OpenMP program, it can't use more than one node. From: Sundaram Kumaran via slurm-users Sent: Friday, August 9, 2024

[slurm-users] Re: Software builds using slurm

2024-06-10 Thread Renfro, Michael via slurm-users
At a certain point, you’re talking about workflow orchestration. Snakemake [1] and its slurm executor plugin [2] may be a starting point, especially since Snakemake is a local-by-default tool. I wouldn’t try reproducing your entire “make” workflow in Snakemake. Instead, I’d define the roughly 60

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users
Forgot to add that Debian/Ubuntu packages are pretty much whatever version was stable at the time of the Debian/Ubuntu .0 release. They’ll backport security fixes to those older versions as needed, but they never change versions unless absolutely required. The backports repositories may have lo

[slurm-users] Re: Location of Slurm source packages?

2024-05-15 Thread Renfro, Michael via slurm-users
Debian/Ubuntu sources can always be found in at least two ways: 1. Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc, .orig.tar.gz, and .debian.tar.xz links there). 2. Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other packages – probably easiest

[slurm-users] Re: [EXT] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
Yep, from your scontrol show node output: CfgTRES=cpu=64,mem=2052077M,billing=64 AllocTRES=cpu=1,mem=2052077M The running job (77) has allocated 1 CPU and all the memory on the node. That’s probably due to the partition using the default DefMemPerCPU value [1], which is unlimited. Since all ou

[slurm-users] Re: SLURM configuration help

2024-04-04 Thread Renfro, Michael via slurm-users
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID” show? On one job we currently have that’s pending due to Resources, that job has requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the node it wants to run on only has 37 CPUs available (seen by

[slurm-users] Re: SLURM configuration for LDAP users

2024-02-04 Thread Renfro, Michael via slurm-users
“An LDAP user can login to the login, slurmctld and compute nodes, but when they try to submit jobs, slurmctld logs an error about invalid account or partition for user.” Since I don’t think it was mentioned below, does a non-LDAP user get the same error, or does it work by default? We don’t u