[slurm-users] Re: single node configuration
On Tue, 2024-04-09 at 11:07:32 -0700, Slurm users wrote: > Hi everyone, I'm conducting some tests. I've just set up SLURM on the head > node and haven't added any compute nodes yet. I'm trying to test it to > ensure it's working, but I'm encountering an error: 'Nodes required for the > job are DOWN, DRAINED, or reserved for jobs in higher priority partitions. > > *[stsadmin@head ~]$ squeue* > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 6 lab test_slu stsadmin PD 0:00 1 (Nodes > required for job are DOWN, DRAINED or reserved for jobs in higher priority > partitions) What does "sinfo" tell you? Is there a running slurmd? - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64
On Mon, 2024-05-06 at 11:38:30 +0100, Slurm users wrote: > Hello, > > I instructed port to use binutils from ports (version 2.40 native) instead > of base: > > `/usr/local/bin/ld: unrecognised emulation mode: elf_aarch64` > > ``` > /usr/local/bin/ld -V |grep aarch64 >aarch64cloudabi >aarch64cloudabib >aarch64elf >aarch64elf32 >aarch64elf32b >aarch64elfb >aarch64fbsd >aarch64fbsdb >aarch64haiku >aarch64linux >aarch64linux32 >aarch64linux32b >aarch64linuxb >aarch64pe > ``` > > Any clues about "elf_aarch64" and "aarch64elf" mismatch? This looks (I admit, I haven't UTSL) like the emulation mode is constructed from an "elf_" prefix and the architecture nickname - this works for "x86_64" and "i386" since the "ld" for the Intel/AMD architectures indeed provides the emulations "elf_x86_64" and "elf_i386" while for 64-bit ARM "elf" is used as a suffix. So this is mainly an ld inconsistency, I'm afraid (which might be fixed by adding alias names - but the hopes are pretty low). Non-emulated builds shouldn't be affected by the issue you found, right? (There is Slurm built for ARM64 Debian. Maybe they have patched the source?) Two ways to get this fixed I can imagine: (a) find the place where the emulation mode name is combined, and teach that of possible exceptions to the implemented rule (there may be more than just ARM - what about RISC-V, PPC64*, ...?) (b) interrupt the build in a reasonable place, find all occurreences of the wrong emulation string, and replace it with its existing counterpart There should be no doubt which one I'd prefer - I'll go and read TS ;) Cheers, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: How to exclude master from computing? Set to DRAINED?
On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote: > Dear Slurm users, > > in our project we exclude the master from computing before starting > Slurmctld. We used to exclude the master from computing by simply not > mentioning it in the configuration i.e. just not having: > > PartitionName=SomePartition Nodes=master > > or something similar. Apparently, this is not the way to do this as it > is now a fatal error > >fatal: Unable to determine this slurmd's NodeName You're attempting to start the slurmd - which isn't required on this machine, as you say. Disable it. Keep slurmctld enabled (and declared in the config). > therefore, my *question:* > >What is the best practice for excluding the master node from work? Not defining it as a worker node. > I personally primarily see the option to set the node into DOWN, DRAINED > or RESERVED. These states are slurmd states, and therefor meaningless for a machine that doesn't have a running slurmd. (It's the nodes that are defined in the config that are supposed to be able to run slurmd.) > So is *DRAINED* the correct setting in such a case? Since this only applies to a node that has been defined in the config, and you (correctly) didn't do so, there's no need (and no means) to "drain" it. Best Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Background tasks in Slurm scripts?
Good morning, yesterday I came across a Slurm (sbatch) script that, after doing some stuff in the foreground, runs another executable in the background - and doesn't "wait" for it to finish - literally the last line of the script is executable & (and that executable is supposed to take several 10 seconds or more to finish) How would Slurm handle this? Will the end of the script immediately trigger the job epilog, and what would happen to the leftover task? This certainly is discussed somewhere in the manual pages and other documentation but up to now I failed to find that place... Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Background tasks in Slurm scripts?
On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote: > Good Morning; > > This is not a slurm issue. This is a default shell script feature. If you > want to wait to finish until all background processes, you should use wait > command after all. Thank you - I already knew this in principle, and I also know that a login shell will complain at an attempt to exit when there are leftover background jobs. I was wondering though how Slurm's task control would react... Got to try myself, I guess... Best, S -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurm fails before nvidia-smi command
On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote: > Hi there all, > > We have Dell server with 2 x Nvidia H100 and running slurm on it. After > restart server if we do not write nvidia-smi command slurm fails. When we > run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , > slurm queue begins. Do you have any idea about this error and what can we do > for this issue? Apparently the nvidia driver doesn't get loaded on reboot? There are multiple ways - add to /etc/modules, run modprobe nvidia via a @reboot crontab entry (or even run nvidia-smi in this way)... Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Find out submit host of past job?
Hello everyone, I've grepped the manual pages and crawled the 'net, but couldn't find any answer to the following problem: I can see that the ctld keeps a record of it below /var/spool/slurm - as long as the job is running or waiting (and shown by "squeue") - and that this stores the environment that contains SLURM_SUBMIT_HOST - but this information seems to be lost when the job finishes. Is there a way to find out what the value of SLURM_SUBMIT_HOST was? I'd be interested in a few more env variables, but this one should be sufficient for a start... Is "sacct" just lacking a job field, or is this info indeed dropped and not stored in the DB? Thanks, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Find out submit host of past job?
On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote: > Warning on that one, it can eat up a ton of database space (depending on > size of environment, uniqueness of environment between jobs, and number of > jobs). We had it on and it nearly ran us out of space on our database host. > That said the data can be really useful depending on the situation. > > -Paul Edmon- > > On 8/7/2024 8:51 AM, Juergen Salk via slurm-users wrote: > > Hi Steffen, > > > > not sure if this is what you are looking for, but with > > `AccountingStoreFlags=job_env´ > > set in slurm.conf, the batch job environment will be stored in the > > accounting database and can later be retrieved with `sacct -j > > --env-vars´ > > command. On Wed, 2024-08-07 at 14:56:30 +0200, Slurm users wrote: > What you're looking for might be doable simply by setting the > AccountStoreFlags parameter in slurm.conf. [1] > > Be aware, though, that job_env has sometimes been reported to grow quite > large. I see, I cannot have the cake and eat it at the same time. Given the size of our users' typical env, I'm dropping the idea for now - maybe this will come up again in the not-so-far future. (Maybe it's worth a feature request?) Thanks everyone! - Steffen -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: error: Unable to contact slurm controller (connect failure)
Hi Daniel, > error: Unable to contact slurm controller (connect failure) > > I appreciate any insight on what could be the cause. Can you check that the slurmctld is up and running, and that the said commands work on the controller machine itself? If the slurmctld cannot be started as a service, try to run it in verbose debug mode (-D -vvv) and find out what might be wrong with it. If it runs in foreground, check the systemd service again. Proceed to compute nodes only when you are sure that the ctld is OK. (IIRC there was a flag in the systemd service definition that had to be adjusted after an upgrade, maybe you're hitting the same?) Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: slurm nodes showing down*
iHi, On Sun, 2024-12-08 at 21:57:11 +, Slurm users wrote: > I have just rebuilt all my nodes and I see Did they ever work before with Slurm? (Which version?) > Only 1 & 2 seem available? > While 3~6 are not Either you didn't wait long enough (5 minutes should be sufficient), or the "down*" nodes don't have a slurmd that talks to the slurmctld. The reasons for the latter can only be speculated about. > 3's log, > > [root@node3 log]# tail slurmd.log > [2024-12-08T21:45:51.250] CPU frequency setting not configured for this node > [2024-12-08T21:45:51.251] slurmd version 20.11.9 started > [2024-12-08T21:45:51.252] slurmd started on Sun, 08 Dec 2024 21:45:51 + > [2024-12-08T21:45:51.252] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 > Memory=48269 TmpDisk=23324 Uptime=30 CPUSpecList=(null) FeaturesAvail=(null) > FeaturesActive=(null) Does this match (exceed, for Memory and TmpDisk) the node declaration known by the slurmctld? > And 7 doesnt want to talk to the controller. > > [root@node7 slurm]# sinfo > slurm_load_partitions: Zero Bytes were transmitted or received Does it have munge running, with the right key? I've seen this message when authorization was lost. > These are all rebuilt and 1~3 are identical and 4~7 are identical. Are the node declarations also identical, respectively? Do they show the same features in slurmd.log? > [root@vuwunicoslurmd1 slurm]# sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 2 idle* node[1-2] > debug* up infinite 4 down* node[3-6] What you see here is what the slurmctld sees. The usual procedure to debug is to run the daemons that don't cooperate, in debug mode. Stop their services, start them manually one by one (ctld first), then watch whether they talk to each other, and if they don't, learn what stops them from doing so - then iterate editing the config, "scontrol reconfig", lather, rinse, repeat. You're the only one knowing your node configuration lines (NodeName=...), so we can't help any further. Ole's pages perhaps can. Best, S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Permission denied for slurmdbd.conf
On Sat, 2024-12-28 at 22:59:45 -, Slurm users wrote: > ls -ls /usr/local/slurm/etc/slurmdbd.conf > 4 -rw--- 1 slurm slurm 497 Dec 28 16:34 /usr/local/slurm/etc/slurmdbd.conf > > sudo -u slurm /usr/local/slurm/sbin/slurmdbd -Dvvv > > slurmdbd: error: s_p_parse_file: unable to read > "/usr/local/slurm/etc/slurmdbd.conf": Permission denied > slurmdbd: fatal: Could not open/read/parse slurmdbd.conf file > /usr/local/slurm/etc/slurmdbd.conf What are the permissions of the directory hosting the file (and the full tree leading there)? ls -ld /usr/local/slurm/etc Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: formatting node names
On Mon, 2025-01-06 at 12:55:12 -0700, Slurm users wrote: > Hi all, > I remember seeing on this list a slurm command to change a slurm-friendly > list such as > > gpu[01-02],node[03-04,12-22,27-32,36] > > into a bash friendly list such as > > gpu01 > gpu02 > node03 > node04 > node12 > etc I always forget that one as well ("scontrol show hostlist" works in the opposite direction) but I have a workaround at hand: pdsh -w gpu[01-02],node[03-04,12-22,27-32,36] -N -R exec echo %h You may use "-f 1", if you prefer a sorted output. (I use to pipe the output through "xargs" most of the time, too.) Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions
On Sat, 2025-01-04 at 08:11:21 -, Slurm users wrote: > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 26 cpu myscriptuser1 PD 0:00 4 > (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher > priority partitions) > Anyone can help to fix this? Not without a little bit of extra information, e.g. "sinfo -p cpu" and maybe "scontrol show job=26" Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Unexpected node got allocation
On Thu, 2025-01-09 at 07:51:40 -0500, Slurm users wrote: > Hello there and good morning from Baltimore. > > I have a small cluster with 100 nodes. When the cluster is completely empty > of all jobs, the first job gets allocated to node 41. In other clusters, > the first job gets allocated to mode 01. If I specify node 01, the > allocation works perfectly. I have my partition NodeName set as > node[01-99], so having node41 used first is a surprise to me. We also have > many other partitions which start with node41, but the partition being used > for the allocation starts with node01. > > Does anyone know what would cause this? Just a wild guess, but do you have a topology.conf file that somehow makes this node look most reasonable to use for a single-node job? (Topology attempts to assign, or hold back, sections of your network to maximize interconnect bandwidth for multi-node jobs. Your node41 might be one - or the first one of a series - that would leave bigger chunks unused for bigger tasks.) HTH, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: setting up slurmdbd (fail)
On Tue, 2025-03-04 at 01:03:00 +, Slurm users wrote: > I am trying to add slurmdbd to my first attempt of slurmctld. > > I have mariadb 10.11 running and permissions set. > > MariaDB [(none)]> CREATE DATABASE slurm_acct_db; > Query OK, 1 row affected (0.000 sec) > > MariaDB [(none)]> show databases; > ++ > | Database | > ++ > | information_schema | > | slurm_acct_db | > ++ > > > Following the setup at, > https://slurm.schedmd.com/accounting.html#mysql-configuration > > When I try to start slurmdbd it fails. > > [root@vuwunicoslurmd3 ~]# systemctl status slurmdbd > ? slurmdbd.service - Slurm DBD accounting daemon > Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; disabled; > preset: disabled) > Active: inactive (dead) > [root@vuwunicoslurmd3 ~]# systemctl enable --now slurmdbd > Created symlink /etc/systemd/system/multi-user.target.wants/slurmdbd.service > ? /usr/lib/systemd/system/slurmdbd.service. > [root@vuwunicoslurmd3 ~]# systemctl status slurmdbd > ? slurmdbd.service - Slurm DBD accounting daemon > Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; > preset: disabled) > Active: inactive (dead) > Condition: start condition failed at Tue 2025-03-04 00:54:38 UTC; 1s ago > ?? ConditionPathExists=/etc/slurm/slurmdbd.conf was not met TIL about the "--now" option to "systemctl enable"... thanks for this one! ;) although I admit to prefer a step-by-step approach (and I'd only enable a unit if it's been successfully started once, to avoid complaints at reboot)... You wrote that you configured MySQL but didn't mention SlurmDBD config. Does the file that is being complained about exist (on that machine)? > So there seems to be a hole in the guide. Some config is needed? To be honest, I've been following Ole's detailed setup instructions since Adam and Eve - not the ones directly from the horse's mouth. Whatever, I'd first try to track down that ConditionPathExists issue... Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com