8,
"mode": "backup"
}
Other commands fail with:
"error_number": 1007,
"error": "Protocol authentication error",
I'll admit, I don't usually use sockets, so I could easily be
overlooking something there. Permissions on the socket look right. I am
getting json back, so it is connecting. Note: slurmrestd is running
under it's own user (not root and not slurmuser).
Any ideas?
Thanks in advance,
Brian Andrus
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
, this generally gives ample
time to recover without issue.
Brian Andrus
On 2/20/2025 6:45 PM, hermes via slurm-users wrote:
Thank you for your insightful suggestions. Placing both slurmdbd and
slurmctld on the same node is indeed a new structure that we hadn’t
considered before, and it
emons are down, then start the first. Once it is
up (you can run scontrol show config) start the second. Run 'scontrol
show config' again and you should see both daemons listed as 'up at the
end of the output.
-Brian Andrus
On 2/3/2025 7:29 PM, Steven Jones via slurm-users wrot
ts to see what it is asking for that
does not exist 'scontrol show job xxx'
Brian Andrus
On 1/4/2025 3:41 AM, John Hearns via slurm-users wrote:
Output of sinfo and squeue
Look at slurmd log in an example node also
Tail -f is your friend
On Sat, Jan 4, 2025, 8:13 AM sportlecon spor
Ensure cgroups is working and configured to limit access to devices
(which includes gpus).
Check your cgroup.conf to see that there is an entry for:
ConstrainDevices=yes
Brian Andrus
On 1/3/2025 10:49 AM, Groner, Rob via slurm-users wrote:
I'm not entirely sure, and I can't
light about that.
Brian Andrus
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
You only have one partition named 'default'
You are not allowed to name it that. Name it something else and you
should be good.
Brian Andrus
On 11/28/2024 6:52 AM, Patrick Begou via slurm-users wrote:
Hi Kent,
on your management node could you run:
systemctl status slurmctld
and
cket, numa, board, and node.
Brian Andrus
On 11/3/2024 12:06 AM, Bhaskar Chakraborty wrote:
Hi Brian,
Thanks for the response!
However, this particular approach where we need to accept whatever
slurm gives us as starting node
and deal with it accordingly doesn’t work for us.
I think there
stuff here
*)
Run all other stuff here
esac
Takes some coding effort but keeps control of the processes within your
own code.
Brian Andrus
On 10/30/24 09:35, Bhaskar Chakraborty via slurm-users wrote:
Hi,
Is there a way to change/control the primary node (i.e. where the
initial task start
files
that map GPUs to HPC jobs./
It does go on to show the conventions/format of the files.
I imagine you could have some bits in a prologue script that creates
that as the job starts on the node and point dcgm-exporter there.
Brian Andrus
On 10/16/24 06:10, Sylvain MARET via slurm-users
IIRC, you need to ensure reverse lookup for DNS matches your nodename
Brian Andrus
On 9/20/2024 4:55 PM, Jakub Szarlat via slurm-users wrote:
Hi
I'm using dynamic nodes with "slurmd -Z" with slurm 23.11.1.
Firstly I find that when you do "scontrol show node" it
Folks have addressed the obvious config settings, but also check your
prolog/epilog scripts/settings as well as the .bashrc/.bash_profile and
stuff in /etc/profile.d/
That may be hanging it up.
Brian Andrus
On 9/5/2024 5:17 AM, Loris Bennett via slurm-users wrote:
Hi,
With
$ salloc
others did nothing.
Brian Andrus
On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote:
Hello,
we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.
The relevant sections in slurm.conf read
They
are more than happy to do that.
Brian Andrus
On 8/29/2024 11:48 PM, Matteo Guglielmi via slurm-users wrote:
I'm sorry, but I still don't get it.
Isn't --nodes=2,4 telling slurm to allocate 2 OR 4 nodes and nothing else?
So, if:
--nodes=2 allocates only two nodes
. Slurm does not give you 4 nodes because you only want 3 tasks.
You see the result in your variables:
SLURM_NNODES=3
SLURM_JOB_CPUS_PER_NODE=1(x3)
If you only want 2 nodes, make --nodes=2.
Brian Andrus
On 8/29/24 08:00, Matteo Guglielmi via slurm-users wrote:
Hi,
On sbatch's manpage
logs and check your conf see what your
defaults are.
Brian Andrus
On 8/29/2024 5:04 AM, Matteo Guglielmi via slurm-users wrote:
Hello,
I have a cluster with four Intel nodes (node[01-04], Feature=intel) and four
Amd nodes (node[05-08], Feature=amd).
# job file
#SBATCH --ntasks=3
#SBATCH
d, which would help.
If they are all exiting with exit code 9, you need to look at the code
for your a.out to see what code 9 means, as that is who is raising that
error. Slurm sees that and if it is non-zero, it interprets it as a
failed job.
Brian Andrus
On 8/19/2024 12:50 AM, Arko Roy v
and ensure slurmd is happier.
Brian Andrus
On 8/14/24 17:52, Sid Young via slurm-users wrote:
G'Day all,
I've been upgrading cmy cluster from 20.11.0 in small steps to get to
24.05.2. Currently 1 have all nodes on 23.02.8, the controller on
24.05.2 and a single test node on 24.05.
If you need it, you could add it to either prologue or epilogue to store
the info somewhere.
I do that for the scripts themselves and keep the past two weeks backed
up so we can debug if/when there is an issue.
Brian Andrus
On 8/7/2024 6:29 AM, Steffen Grunewald via slurm-users wrote:
On
to be from a front-end system that interfaces with slurm
and does not seem to show the actual slurm jobid, unless those are the
274398, 274399, and 274400 numbers. If so, you could look in the
slurmctld logs for the jobs to see what may have happened.
Brian Andrus
On 8/6/2024 5:57 AM, Felix via
Generally speaking, when the batch script exits, slurm will clean up (ie
kill) any stray processes.
So, I would expect that executable to be killed at cleanup.
Brian Andrus
On 7/26/2024 2:45 AM, Steffen Grunewald via slurm-users wrote:
On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote
Martin,
In a nutshell, when slurmd starts, it tells that info to slurmctld. That
is the "registration" event mentioned.
Brian Andrus
On 7/19/2024 5:44 AM, Martin Lee via slurm-users wrote:
I've read the following in the slurm power saving docs:
https://slurm.schedmd.com/
You probably want to look at scontrol show node and scontrol show job
for that node and the jobs on it.
Compare those and you may find someone requested most all the resources,
but are not running them properly. Look at the job itself to see what it
is trying to do.
Brian Andrus
On 7/11
Jack,
To make sure things are set right, run 'slurmd -C' on the node and use
that output in your config.
It can also give you insight as to what is being seen on the node versus
what you may expect.
Brian Andrus
On 7/10/2024 1:25 AM, jack.mellor--- via slurm-users wrote:
H
Just a thought.
Try specifying some memory. It looks like the running jobs do that and
by default, if not specified it is "all the memory on the node", so it
can't start because some of it is taken.
Brian Andrus
On 7/4/2024 9:54 AM, Ricardo Cruz wrote:
Dear Brian,
Curre
at that moment.
Brian Andrus
On 7/4/2024 8:43 AM, Ricardo Cruz via slurm-users wrote:
Greetings,
There are not many questions regarding GPU sharding here, and I am
unsure if I am using it correctly... I have configured it according to
the instructions <https://slurm.schedmd.com/gres.html>
going to run in. Because there are multiple possible dependencies/uses,
this is best.
Brian Andrus
On 6/20/2024 1:38 PM, Carl Ponder via slurm-users wrote:
We're seeing SLURM mis-behaving on one of your clusters, that runs
Ubuntu 22.04.
Ampng other problems, we see an error-me
Well, if I am reading this right, it makes sense.
Every job will need at least 1 core just to run and if there are only 4
cores on the machine, one would expect a max of 4 jobs to run.
Brian Andrus
On 6/20/2024 5:24 AM, Arnuld via slurm-users wrote:
I have a machine with a quad-core CPU and
That SIGTERM message means something is telling slurmdbd to quit.
Check your cron jobs, maintenance scripts, etc. Slurmdbd is being told
to shutdown. If you are running in the foreground, a ^C does that. If
you run a kill or killall on it, you will get that same message.
Brian Andrus
On 5
Oh, to address the passed train:
Restore the archive data with "sacctmgr archive load", then you can do
as you need.
From man sacctmgr:
*archive*{dump|load}
Write database information to a flat file or load information that
has previously been written to a file.
Brian Andr
Instead of using the archive files, couldn't you query the db directly
for the info you need?
I would recommend sacct/sreport if those can get the info you need.
Brian Andrus
On 5/28/2024 9:59 AM, O'Neal, Doug (NIH/NCI) [C] via slurm-users wrote:
My organization needs to access hi
I would guess either you install GPU drivers on the non-GPU nodes or
build slurm without GPU support for that to work due to package
dependencies.
Both viable options. I have done installs where we just don't compile
GPU support in and that is left to the users to manage.
Brian Andrus
versions are compatible, they can work together.
You will need to be aware of differences for jobs and configs, but it is
possible.
Brian Andrus
On 5/22/2024 12:45 AM, Arnuld via slurm-users wrote:
We have several nodes, most of which have different Linux
distributions (distro for short
Rike,
Assuming the data, scripts and other dependencies are already on the
cluster, you could just ssh and execute the sbatch command in a single
shot: ssh submitnode sbatch some_script.sh
It will ask for a password if appropriate and could use ssh keys to
bypass that need.
Brian Andrus
added an override file, that
will affect things.
Brian Andrus
On 4/19/2024 10:15 AM, Jeffrey Layton wrote:
I like it, however, it was working before without a slurm.conf in
/etc/slurm.
Plus the environment variable SLURM_CONF is pointing to the correct
slurm.conf file (the one in /cm
nf in /etc/slurm/ on the node(s).
Brian Andrus
On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
Good afternoon,
I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10
(Base Command Manager which is based on Bright Cluster Manager). I ran
into an error and only jus
will want to sync the config across all nodes and then 'scontrol
reconfigure'
You may want to look into configless if you can set DNS entries and your
config is basically monolithic or all parts are in /etc/slurm/
Brian Andrus
On 4/15/2024 2:55 AM, Xaver Stiensmeier via slurm-use
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did
similar when I still had to support EL7
Fairly generic plan, the devil is in the details and verifying each
step, but those are the basic bases you need to touch.
Brian Andrus
On 4/10/2024 1:48 PM, Steve Berg
path to look at.
Brian Andrus
On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote:
Dear slurm user list,
we make use of elastic cloud computing i.e. node instances are created
on demand and are destroyed when they are not used for a certain amount
of time. Created instances are set up
name it has in the
slurm.conf file.
Also, a quick way to do the failover check is to run (from the backup
controller): scontrol takeover
Brian Andrus
On 3/25/2024 1:39 PM, Miriam Olmi wrote:
Hi Brian,
Thanks for replying.
In my first message I forgot to specify that the primary and the
Quick correction, it is SaveStateLocation not SlurmSaveState.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear all,
I am having trouble finalizing the configuration of the backup
controller for my slurm cluster.
In principle, if no job is running everything seems
Miriam,
You need to ensure the SlurmSaveState directory is the same for both.
And by 'the same', I mean all contents are exactly the same.
This is usually achieved by using a shared drive or replication.
Brian Andrus
On 3/25/2024 8:11 AM, Miriam Olmi via slurm-users wrote:
Dear
Wow, snazzy!
Looks very good. My compliments.
Brian Andrus
On 3/12/2024 11:24 AM, Victoria Hobson via slurm-users wrote:
Our website has gone through some much needed change and we'd love for
you to explore it!
The new SchedMD.com is equipped with the latest information about
Slurm,
Chip,
I use 'sacct' rather than sreport and get individual job data. That is
ingested into a db and PowerBI, which can then aggregate as needed.
sreport is pretty general and likely not the best for accurate
chargeback data.
Brian Andrus
On 3/4/2024 6:09 AM, Chip Seraphine via s
Joseph,
You will likely get many perspectives on this. I disable swap completely
on our compute nodes. I can be draconian that way. For the workflow
supported, this works and is a good thing.
Other workflows may benefit from swap.
Brian Andrus
On 3/3/2024 11:04 PM, John Joseph via slurm
oxy>
Brian Andrus
On 2/28/2024 12:54 PM, Dan Healy wrote:
Are most of us using HAProxy or something else?
On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users
wrote:
Magnus,
That is a feature of the load balancer. Most of them have that
these days.
Brian Andrus
Magnus,
That is a feature of the load balancer. Most of them have that these days.
Brian Andrus
On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote:
On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote:
for us, we put a load balancer in front of the
disconnection
for any reason even for X-based apps.
Personally, I don't care much for interactive sessions in HPC, but there
is a large body that only knows how to do things that way, so it is there.
Brian Andrus
On 2/26/2024 12:27 AM, Josef Dvoracek via slurm-users wrote:
What is the recommende
I imagine you could create a reservation for the node and then when you
are completely done, remove the reservation.
Each helper could then target the reservation for the job.
Brian Andrus
On 2/9/2024 5:52 PM, Alan Stange via slurm-users wrote:
Chip,
Thank you for your prompt response. We
look elsewhere.
Brian Andrus
On 1/26/2024 6:38 AM, Michael Lewis wrote:
Hi All,
I’m trying to get slurm-23.11.3 running on Ubuntu 20.04 and running on
a stand alone system. I’m running into an issue I can not find the
answer to. After compiling and installing when I fire up slurmctld
While I am not sure of your specifics, you could easily add lines to
your suspend/resume scripts to check/wait/etc if there are tasks waiting.
Brian Andrus
On 1/15/2024 12:22 AM, 김종록 wrote:
Hello.
I'm going to use Slurm's cloud feature in private cloud.
The problem is that the
a submit/login node.
Brian Andrus
On 12/15/2023 2:00 AM, Felix wrote:
Hello
we are installing a new server with slurm on ALMA Linux 9.2
we did the followimg:
dnf install slurm
The result is
rpm -qa | grep slurm
slurm-libs-22.05.9-1.el9.x86_64
slurm-22.05.9-1.el9.x86_64
Now when trying to
filled on the node. You can run 'df -h' and see
some info that would get you started.
Brian Andrus
On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote:
Dear slurm-user list,
during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". T
nality of being able to have "keep at
least X nodes up and idle" would be nice, that is not how I see this
documented or working.
Brian Andrus
On 11/23/2023 5:12 AM, Davide DelVento wrote:
Thanks for confirming, Brian. That was my understanding as well. Do
you have it working that
As I understand it, that setting means "Always have at least X nodes
up", which includes running jobs. So it stops any wait time for the
first X jobs being submitted, but any jobs after that will need to wait
for the power_up sequence.
Brian Andrus
On 11/22/2023 6:58 AM, David
Eg,
Could you be more specific as to what you want?
Is there a specific user you want to control, or no user should get more
than x cpus in the partition? Or no single job should get more than x cpus?
The details matter to determine the right approach and right settings.
Brian Andrus
On 11
your slurm users
belong to and add them to slurmdbd. Once they are in there, you can set
defaults with exceptions for specific users.
If you are only looking to have settings apply to all users, you don't
have to import the users. Set the QoS for the partition.
Brian Andrus
On 11/20/2023 1:
How do you 'manually create a directory'? That would be when the
ownership of root would be occurring. After creating it, you can
chown/chmod it as well.
Brian Andrus
On 11/18/2023 7:35 AM, Arsene Marian Alain wrote:
Dear slurm community,
I run slurm 21.08.1 under Rocky Linux
Vlad,
Actually, it looks like it is working. You are using v0.39 for the
parser, which is trying to use OpenAPI calls. Unless you compiled with
OpenAPI, that won't work.
Try using the 0.37 version and you may see a simpler result that is
successful.
Brian Andrus
On 6/28/2023 11:
can squeeze in before the additional node for Job B is expected to
be available, so it runs on the idle node.
Brian Andrus
On 6/26/2023 3:48 PM, Reed Dier wrote:
Hoping this will be an easy one for the community.
The priority schema was recently reworked for our cluster, with only
Priori
ineate what node can do what (a node-locked
license, for example). Then you can send a job to a specific subset of
nodes.
Quite a few other ways to design the ability you describe, but separate
clusters is not one of them.
Brian Andrus
On 6/26/2023 6:11 AM, mohammed shambakey wrote:
Hi
Jus
Second that.
Prometheus+slurm exporter+grafana works great.
Brian Andrus
On 6/12/2023 8:20 AM, Josef Dvoracek wrote:
> But I'd be interested to see what other places do.
we installed this: https://github.com/vpenso/prometheus-slurm-exporter
and scrape this exporter with "inpu
Make sure you have configured the RebootProgram in slurm.conf, that it
exists on the nodes and is executable by the user.
This is usually /sbin/reboot
Brian Andrus
On 6/7/2023 7:50 AM, Heinz, Michael wrote:
Hey, all.
So I added slurmdbd to our slurm-23.02 install and made my account an
That output of slurmd -C is your answer.
Slurmd only sees 6GB of memory and you are claiming it has 10GB.
I would run some memtests, look at meminfo on the node, etc.
Maybe even check that the type/size of memory in there is what you think
it is.
Brian Andrus
On 5/25/2023 7:30 AM, Roger
Defaulting to a shell for salloc is a newer feature.
For your version, you should:
srun -n 1 -t 00:10:00 --mem=1G --pty bash
Brian Andrus
On 5/19/2023 8:24 AM, Ryan Novosielski wrote:
I’m not at a computer, and we run an older version of Slurm yet so I
can’t say with 100% confidence that
jobs are
running on.
Brian Andrus
On 5/17/2023 10:49 AM, Groner, Rob wrote:
I'm not sure what you mean by "if they have the permissions". I'm
talking about someone who is specifically designated as "coordinator"
of an account in slurm. With that designation
you need to preempt running jobs, that would take a bit more effort
to set up, but is an alternative.
Brian Andrus
On 5/17/2023 6:40 AM, Groner, Rob wrote:
I was asked to see if coordinators could do anything in this scenario:
* Within the account that they coordinated, User A submitted 1000s
ut counts in a comma separated list (e.g
"nid[10-20]:4,nid[80-90]:2"). By default no nodes are excluded. This
value may be updated with scontrol. See
ReconfigFlags=KeepPowerSaveSettings for setting persistence.
Brian Andrus
On 5/12/2023 2:35 AM, Xaver Stiensmeier wrote:
D
Something I have been impressed with is Netdata
It is in the standard repositories and will auto-detect quite a bit of
things on a node. It is great for real-time monitoring of a node/job.
I also use Prometheus and Grafana for historic data (anything over 5
minutes).
Brian Andrus
On 5/5
ente wrote:
Hello,
Brian Andrus writes:
Ole is spot on with his federated suggestion. That is exactly what fits the bill
for you, given your requirements. You can have everything you want, but you
don't get to have it how you want (separate databases).
When/If you looked deeper into it, you wi
different part of the world and trying to federate them
in a performant manner was prohibitively expensive.
Brian Andrus
On 4/29/2023 10:53 PM, Angel de Vicente wrote:
Hi Ole,
Ole Holm Nielsen writes:
Maybe you want to use Slurm federated clusters with a single database
thanks for
the HA database. One would be primary and the other a
failover (AccountingStorageBackupHost). Although, technically, they
would both be able to be active at the same time.
Brian Andrus
On 4/13/2023 2:49 AM, Shaghuf Rahman wrote:
Hi,
I am setting up Slurmdb in my system and I need some inputs
user exists on the node, however you are propagating
the uids.
Brian ANdrus
On 4/11/2023 9:48 AM, Jason Simms wrote:
Hello all,
Regularly I'm seeing array jobs fail, and the only log info from the
compute node is this:
[2023-04-11T11:41:12.336] error: /opt/slurm/prolog.sh: exited
few things:
[Unit]
After=autofs.service getty.target sssd.service
That makes it wait for all of those before trying to start.
Brian Andrus
On 3/10/2023 7:41 AM, Tristan LEFEBVRE wrote:
Hello to all,
I'm trying to do an installation of Slurm with cgroupv2 activated.
But I'm facing
the
node ensure the shared filesystems are mounted before allowing jobs.
-Brian Andrus
On 3/6/2023 1:15 AM, Niels Carl W. Hansen wrote:
Hi all
Seems there still are some issues with the autofs -
job_container/tmpfs functionality in Slurm 23.02.
If the required directories aren't mounted o
do as well.
I would be insterested in any alternatives. Could you point me to some doc?
Best wishes
Gizo
Brian Andrus
On 2/28/2023 7:44 AM, Gizo Nanava wrote:
Hello,
it seems that if a slurm power saving is enabled then the parameter
"Weight" seem to be ignored for nodes tha
get (resource-wise) and how do you want
to use them?
Brian Andrus
On 2/28/2023 9:49 AM, Jake Jellinek wrote:
Hi all
I come from a SGE/UGE background and am used to the convention that I can qrsh
to a node and, from there, start a new qrsh to a different node with different
parameters.
I
You may be able to use the alternate approach that I was able to do as well.
Brian Andrus
On 2/28/2023 7:44 AM, Gizo Nanava wrote:
Hello,
it seems that if a slurm power saving is enabled then the parameter
"Weight" seem to be ignored for nodes that are in a power down state.
Is
most jobs.
Perhaps there is some additional lines that could be added to the job
that would do a call to a snakemake API and report itself? Or maybe such
an API could be created/expanded.
Just a quick 2 cents (We may be up to a few dollars with all of those so
far).
Brian Andrus
On 2/27/202
formance answer lies in how any of the processes
work, which is why some of us do so many experimental runs of jobs and
gather timings. We have yet to see a 100% efficient process, but folks
are improving things all the time.
Brian Andrus
On 2/13/2023 9:56 PM, Diego Zuccato wrote:
I think tha
efficient HPC jobs. The goal is that every process is utilizing the CPU
as close to 100% as possible, which would render hyper-threading moot.
Brian Andrus
On 2/13/2023 12:15 AM, Hermann Schwärzler wrote:
Hi Sebastian,
I am glad I could help (although not exactly as expected :-).
With
commands
are xterm, a shell script containing srun commands, and srun (see the
EXAMPLES section). *If no command is specified, then salloc runs the
user's default shell.*
Brian Andrus
On 2/8/2023 7:01 AM, Jeffrey T Frey wrote:
You may need srun to allocate a pty for the command.
Then cluster_run.sh would call sbatch along with the appropriate commands.
Brian Andrus
On 2/7/2023 9:31 AM, Groner, Rob wrote:
I'm trying to setup the capability where a user can execute:
$: sbatch script_to_run.sh
and the end result is that a job is created on a node, and that job
wi
y with the new (known good) config.
Brian Andrus
On 1/17/2023 12:36 PM, Groner, Rob wrote:
So, you have two equal sized clusters, one for test and one for
production? Our test cluster is a small handful of machines compared
to our production.
We have a test slurm control node on a test cl
ready.
Brian Andrus
On 1/4/2023 9:22 AM, Groner, Rob wrote:
We currently have a test cluster and a production cluster, all on the
same network. We try things on the test cluster, and then we gather
those changes and make a change to the production cluster. We're
doing that through two diffe
lurm/slurm.conf"*/
You can change those as needed. This made it listen on port 8081 only
(no socket and not 6820)
I was then able to just use curl on port 8081 to test things.
Hope that helps.
Brian Andrus
On 12/29/2022 6:49 AM, Chris Stackpole wrote:
Greetings,
Thanks for responding
I suspect if you delete /var/lib/slurmrestd.socket and then start
slurmrestd, it will create it as the user you need it to be.
Or just change the owner of it to the slurmrestd owner.
I have been running slurmrestd as a separate user for some time.
Brian Andrus
On 12/28/2022 3:20 PM, Chris
Seems like the time may have been off on the db server at the insert/update.
You may want to dump the database, find what table/records need updated
and try updating them. If anything went south, you could restore from
the dump.
Brian Andrus
On 12/20/2022 11:51 AM, Reed Dier wrote:
Just to
Try:
sacctmgr list runawayjobs
Brian Andrus
On 12/20/2022 7:54 AM, Reed Dier wrote:
Hoping this is a fairly simple one.
This is a small internal cluster that we’ve been using for about 6
months now, and we’ve had some infrastructure instability in that
time, which I think may be the
the many articles, wikis and videos
out there.
TLDR; If you are going to be running efficient HPC jobs, you are indeed
better off with HT turned off.
Brian Andrus
On 12/13/2022 8:03 AM, Gary Mansell wrote:
Hi, thanks for getting back to me.
I have been doing some more experimenting, and I
assigned to it. Also check the state of the nodes with 'sinfo'
It would also be good to ensure the node settings are right. Run 'slurmd
-C' on a node and see if the output matches what is in the config.
Brian Andrus
On 12/13/2022 1:38 AM, Gary Mansell wrote:
Dear Slurm Us
You may want to look here:
https://slurm.schedmd.com/heterogeneous_jobs.html
Brian Andrus
On 12/7/2022 12:42 AM, Le, Viet Duc wrote:
Dear slurm community,
I am encountering a unique situation where I need to allocate jobs to
nodes with different numbers of CPU cores. For instance
I successfully build it for Rocky straight from the tgz file as usual
with rpmbuild -ta
Brian Andrus
On 12/2/2022 9:21 AM, David Thompson wrote:
Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8
Slurm cluster. We would like to be able to use the sbatch –prefer
option
ed to submit at all? The reservation method can cause an sbatch
command to be rejected, if that is what you are looking for.
Brian Andrus
On 11/30/2022 6:29 AM, Richard Ems wrote:
Hi all,
I have to change our set up to be able to update the total number of
available licenses due to users che
Steve,
I suspect you did not install the packages.
You need to install slurm-slurmctld to get the slurmctld systemd files:
/# rpm -qlp slurm-slurmctld-20.11.9-1.el7.x86_64.rpm//
///run/slurm/slurmctld.pid//
/*//usr/lib/systemd/system/slurmctld.service/*/
///usr/sbin/slurmctld//
reset/recreate it.
That addresses even a miffed software change.
Brian Andrus
On 11/23/2022 5:11 AM, Xaver Stiensmeier wrote:
Hello slurm-users,
The question can be found in a similar fashion here:
https://stackoverflow.com/questions/74529491/slurm-handling-nodes-that-fail-to-power-up-in-a
processing
data. There are many ways to do that, but those designs fall under
MariaDB and not Slurm.
Brian Andrus
On 11/1/2022 6:49 PM, Richard Chang wrote:
Does it mean it is best to use a single slurmdbd host in my case?
My primary slurmctld is the backup slurmdbd host, and my worry is if
t
Ole,
Fair enough, it is actually slurmctld that does the caching. Technical
typo on my part there.
Just trying to let the user know, there is a window that they have to
ensure no information is lost during a database outage.
Brian Andrus
On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:
Hi
It caches up to a point. As I understand it, that is about an hour
(depending on size and how busy the cluster is, as well as available
memory, etc).
Brian Andrus
On 10/31/2022 9:20 PM, Richard Chang wrote:
Hi,
Just for my info, I would like to know what happens when SlurmDBD
loses
YMMV, but if you aren't having excessive traffic to the
share, you should be good. I have yet to discover what would be
excessive enough to impact things.
The only use I have had for the HA is being able to keep the cluster
running/happy during maintenance.
Brian Andrus
On 10/24/2022 1:
1 - 100 of 384 matches
Mail list logo