Hi.
I'm no expert, but it seems ChatGPT is confusing "queued" and "running"
jobs. Assuming you are interested in temporarily shutting down slurmctld
node for maintenance.
If the jobs are still queued ( == not yet running) what do you need to
save? The queue order is dynamically adjusted by slurmctld based on the
selected factors, there's nothing special to save.
For the running jobs, OTOH, you have multiple solutions:
1) drain the cluster: safest but often impractical
2) checkpoint: seems fragile, expecially if jobs span multiple nodes
3) have a second slurmd node (a small VM is sufficient) that takes over
the cluster management when the master node is down (be *sure* the state
dir is shared and quite fast!)
4) just hope you'll be able to recover the slurmctld node before a job
completes *and* the timeouts expire
While 4 is relatively risky (you could end up with runaway jobs that
you'll have to fix afterwards), it does not directly impact users: their
jobs will run and complete/fail regardless of slurmctld state. At most
the users won't receive a completion mail and they will be billed less
than expected.
Diego
Il 10/02/2023 20:06, Analabha Roy ha scritto:
Hi,
I'm having some complex issues coordinating OpenMPI, SLURM, and DMTCP in
my cluster. On a whim, I logged into ChatGPT and asked the AI about it.
It told me things that I couldn't find in the current version of the
SLURM docs (I looked). Since ChatGPT is not always reliable, I reproduce
the
contents of my chat session in my GitHub repository for peer review and
commentary by you fine folks.
https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md
<https://github.com/hariseldon99/buparamshavak/blob/main/chatgpt.md>
I apologize for the poor formatting. I did this in a hurry, and my
knowledge of markdown is rudimentary.
Please do comment on the veracity and reliability of the AI's response.
AR
--
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: dan...@utexas.edu <mailto:dan...@utexas.edu>,
a...@phys.buruniv.ac.in <mailto:a...@phys.buruniv.ac.in>,
hariseldo...@gmail.com <mailto:hariseldo...@gmail.com>
Webpage: http://www.ph.utexas.edu/~daneel/
<http://www.ph.utexas.edu/~daneel/>
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786