We only do isolated on the students’ VirtualBox setups because it’s simpler for
them to get started with. Our production HPC with OpenHPC is definitely
integrated with our Active Directory (directly via sssd, not with an
intermediate product), etc. Not everyone does it that way, but our scale is
Late to the party here, but depending on how much time you have invested, how
much you can tolerate reformats or other more destructive work, etc., you might
consider OpenHPC and its install guide ([1] for RHEL 8 variants, [2] or [3] for
RHEL 9 variants, depending on which version of Warewulf yo
As Thomas had mentioned earlier in the thread, there is --exclusive with no
extra additions. But that’d prevent *every* other job from running on that
node, which unless this is a cluster for you and you alone, sounds like wasting
90% of the resources. I’d be most perturbed at a user doing that
I’ve never done this myself, but others probably have. At the end of [1],
there’s an example of making a generic resource for bandwidth. You could set
that to any convenient units (bytes/second or bits/second, most likely), and
assign your nodes a certain amount. Then any network-intensive job c
I’ll start with the question of “why spread the jobs out more than required?”
and move on to why the other items didn’t work:
1. exclusive only ensures that others’ jobs don’t run on a node with your
jobs, and does nothing about other jobs you own.
2. spread-job distributes the work of on
Not so much about the source, but in the sbatch documentation [1], I think the
--begin and --nodes parameters might interact. And yes, this is semi-educated
speculation on my part.
From the nodes= section, “The job will be allocated as many nodes as possible
within the range specified and witho
Since nobody replied after this, if the nodes are incapable of running the jobs
due to insufficient resources, it may be that the default
“EnforcePartLimits=No” [1] might be an issue. That might allow a job to stay
queued even if it’s impossible to run.
[1] https://slurm.schedmd.com/slurm.conf.
Do you have backfill scheduling [1] enabled? If so, what settings are in place?
And the lower-priority jobs will only be eligible for backfill if and only if
they don’t delay the start of the higher priority jobs.
So what kind of resources and time does a given array job require? Odds are,
they
In theory, if jobs are pending with “Priority”, one or more other jobs will be
pending with “Resources”.
So a few questions:
1. What are the “Resources” jobs waiting on, resource-wise?
2. When are they scheduled to start?
3. Can your array jobs backfill into the idle resources and fini
..etc etc etc...
Does that look right?
On Aug 9, 2024, at 4:05 PM, Renfro, Michael via slurm-users
wrote:
External Email - Use Caution
I don’t have any 21.08 systems to verify with, but that’s how I remember it.
Use “sshare -a -A mic” to verify. You should see both a RawShares
I don’t have any 21.08 systems to verify with, but that’s how I remember it.
Use “sshare -a -A mic” to verify. You should see both a RawShares and a
NormShares column for each user. By default they’ll all have the same value,
but they can be adjusted if needed.
From: Drucker, Daniel via slurm-u
It may be difficult to narrow down the problem without knowing what commands
you're running inside the salloc session. For example, if it's a pure OpenMP
program, it can't use more than one node.
From: Sundaram Kumaran via slurm-users
Sent: Friday, August 9, 2024
At a certain point, you’re talking about workflow orchestration. Snakemake [1]
and its slurm executor plugin [2] may be a starting point, especially since
Snakemake is a local-by-default tool. I wouldn’t try reproducing your entire
“make” workflow in Snakemake. Instead, I’d define the roughly 60
Forgot to add that Debian/Ubuntu packages are pretty much whatever version was
stable at the time of the Debian/Ubuntu .0 release. They’ll backport security
fixes to those older versions as needed, but they never change versions unless
absolutely required.
The backports repositories may have lo
Debian/Ubuntu sources can always be found in at least two ways:
1. Pages like https://packages.ubuntu.com/jammy/slurm-wlm (see the .dsc,
.orig.tar.gz, and .debian.tar.xz links there).
2. Commands like ‘apt-get source slurm-wlm’ (may require ‘dpkg-dev’ or other
packages – probably easiest
Yep, from your scontrol show node output:
CfgTRES=cpu=64,mem=2052077M,billing=64
AllocTRES=cpu=1,mem=2052077M
The running job (77) has allocated 1 CPU and all the memory on the node. That’s
probably due to the partition using the default DefMemPerCPU value [1], which
is unlimited.
Since all ou
What does “scontrol show node cusco” and “scontrol show job PENDING_JOB_ID”
show?
On one job we currently have that’s pending due to Resources, that job has
requested 90 CPUs and 180 GB of memory as seen in its ReqTRES= value, but the
node it wants to run on only has 37 CPUs available (seen by
“An LDAP user can login to the login, slurmctld and compute nodes, but when
they try to submit jobs, slurmctld logs an error about invalid account or
partition for user.”
Since I don’t think it was mentioned below, does a non-LDAP user get the same
error, or does it work by default?
We don’t u
18 matches
Mail list logo