from:"Steffen Grunewald via slurm\-users"

[slurm-users] Re: single node configuration

2024-04-10 Thread Steffen Grunewald via slurm-users

On Tue, 2024-04-09 at 11:07:32 -0700, Slurm users wrote:
> Hi everyone, I'm conducting some tests. I've just set up SLURM on the head
> node and haven't added any compute nodes yet. I'm trying to test it to
> ensure it's working, but I'm encountering an error: 'Nodes required for the
> job are DOWN, DRAINED, or reserved for jobs in higher priority partitions.
> 
> *[stsadmin@head ~]$ squeue*
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>  6   lab test_slu stsadmin PD   0:00  1 (Nodes
> required for job are DOWN, DRAINED or reserved for jobs in higher priority
> partitions)

What does "sinfo" tell you? Is there a running slurmd?

- S


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

2024-05-06 Thread Steffen Grunewald via slurm-users

On Mon, 2024-05-06 at 11:38:30 +0100, Slurm users wrote:
> Hello,
> 
> I instructed port to use binutils from ports (version 2.40 native) instead
> of base:
> 
> `/usr/local/bin/ld: unrecognised emulation mode: elf_aarch64`
> 
> ```
> /usr/local/bin/ld -V |grep aarch64
>aarch64cloudabi
>aarch64cloudabib
>aarch64elf
>aarch64elf32
>aarch64elf32b
>aarch64elfb
>aarch64fbsd
>aarch64fbsdb
>aarch64haiku
>aarch64linux
>aarch64linux32
>aarch64linux32b
>aarch64linuxb
>aarch64pe
> ```
> 
> Any clues about "elf_aarch64" and "aarch64elf" mismatch?

This looks (I admit, I haven't UTSL) like the emulation mode is constructed
from an "elf_" prefix and the architecture nickname - this works for "x86_64"
and "i386" since the "ld" for the Intel/AMD architectures indeed provides the
emulations "elf_x86_64" and "elf_i386" while for 64-bit ARM "elf" is used as
a suffix. So this is mainly an ld inconsistency, I'm afraid (which might be
fixed by adding alias names - but the hopes are pretty low).

Non-emulated builds shouldn't be affected by the issue you found, right?
(There is Slurm built for ARM64 Debian. Maybe they have patched the source?)


Two ways to get this fixed I can imagine:
(a) find the place where the emulation mode name is combined, and teach that
  of possible exceptions to the implemented rule (there may be more than just
  ARM - what about RISC-V, PPC64*, ...?)
(b) interrupt the build in a reasonable place, find all occurreences of the
  wrong emulation string, and replace it with its existing counterpart

There should be no doubt which one I'd prefer - I'll go and read TS ;)

Cheers,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Steffen Grunewald via slurm-users

On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote:
> Dear Slurm users,
> 
> in our project we exclude the master from computing before starting
> Slurmctld. We used to exclude the master from computing by simply not
> mentioning it in the configuration i.e. just not having:
> 
>     PartitionName=SomePartition Nodes=master
> 
> or something similar. Apparently, this is not the way to do this as it
> is now a fatal error
> 
>fatal: Unable to determine this slurmd's NodeName

You're attempting to start the slurmd - which isn't required on this
machine, as you say. Disable it. Keep slurmctld enabled (and declared
in the config).

> therefore, my *question:*
> 
>What is the best practice for excluding the master node from work?

Not defining it as a worker node.

> I personally primarily see the option to set the node into DOWN, DRAINED
> or RESERVED.

These states are slurmd states, and therefor meaningless for a machine
that doesn't have a running slurmd. (It's the nodes that are defined in
the config that are supposed to be able to run slurmd.)

> So is *DRAINED* the correct setting in such a case?

Since this only applies to a node that has been defined in the config,
and you (correctly) didn't do so, there's no need (and no means) to
"drain" it.

Best
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Background tasks in Slurm scripts?

2024-07-26 Thread Steffen Grunewald via slurm-users

Good morning,

yesterday I came across a Slurm (sbatch) script that, after doing some stuff
in the foreground, runs another executable in the background - and doesn't
"wait" for it to finish - literally the last line of the script is

executable &

(and that executable is supposed to take several 10 seconds or more to finish)

How would Slurm handle this? Will the end of the script immediately trigger
the job epilog, and what would happen to the leftover task? This certainly
is discussed somewhere in the manual pages and other documentation but up to
now I failed to find that place...

Thanks,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Background tasks in Slurm scripts?

2024-07-26 Thread Steffen Grunewald via slurm-users

On Fri, 2024-07-26 at 10:42:45 +0300, Slurm users wrote:
> Good Morning;
> 
> This is not a slurm issue. This is a default shell script feature. If you
> want to wait to finish until all background processes, you should use wait
> command after all.

Thank you - I already knew this in principle, and I also know that a login
shell will complain at an attempt to exit when there are leftover background
jobs. I was wondering though how Slurm's task control would react... Got to
try myself, I guess...

Best, S

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Steffen Grunewald via slurm-users

On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote:
> Hi there all,
> 
> We have Dell server with 2 x Nvidia H100 and running slurm on it. After
> restart server if we do not write nvidia-smi command slurm fails. When we
> run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld ,
> slurm queue begins. Do you have any idea about this error and what can we do
> for this issue?

Apparently the nvidia driver doesn't get loaded on reboot?
There are multiple ways - add to /etc/modules, run modprobe nvidia via
a @reboot crontab entry (or even run nvidia-smi in this way)...

Best,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Find out submit host of past job?

2024-08-07 Thread Steffen Grunewald via slurm-users

Hello everyone,

I've grepped the manual pages and crawled the 'net, but couldn't find any
answer to the following problem:

I can see that the ctld keeps a record of it below /var/spool/slurm - as
long as the job is running or waiting (and shown by "squeue") - and that
this stores the environment that contains SLURM_SUBMIT_HOST
- but this information seems to be lost when the job finishes.

Is there a way to find out what the value of SLURM_SUBMIT_HOST was?
I'd be interested in a few more env variables, but this one should be
sufficient for a start...

Is "sacct" just lacking a job field, or is this info indeed dropped and
not stored in the DB?


Thanks,
 Steffen


-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Steffen Grunewald via slurm-users

On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote:
> Warning on that one, it can eat up a ton of database space (depending on
> size of environment, uniqueness of environment between jobs, and number of
> jobs). We had it on and it nearly ran us out of space on our database host.
> That said the data can be really useful depending on the situation.
> 
> -Paul Edmon-
> 
> On 8/7/2024 8:51 AM, Juergen Salk via slurm-users wrote:
> > Hi Steffen,
> > 
> > not sure if this is what you are looking for, but with 
> > `AccountingStoreFlags=job_env´
> > set in slurm.conf, the batch job environment will be stored in the
> > accounting database and can later be retrieved with `sacct -j  
> > --env-vars´
> > command.

On Wed, 2024-08-07 at 14:56:30 +0200, Slurm users wrote:
> What you're looking for might be doable simply by setting the
> AccountStoreFlags parameter in slurm.conf. [1]
> 
> Be aware, though, that job_env has sometimes been reported to grow quite
> large.

I see, I cannot have the cake and eat it at the same time.
Given the size of our users' typical env, I'm dropping the idea for now -
maybe this will come up again in the not-so-far future. (Maybe it's worth
a feature request?)

Thanks everyone!

- Steffen

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: error: Unable to contact slurm controller (connect failure)

2024-11-18 Thread Steffen Grunewald via slurm-users

Hi Daniel,

>  error: Unable to contact slurm controller (connect failure)
> 
> I appreciate any insight on what could be the cause.

Can you check that the slurmctld is up and running, and that the said
commands work on the controller machine itself?
If the slurmctld cannot be started as a service, try to run it in verbose
debug mode (-D -vvv) and find out what might be wrong with it.
If it runs in foreground, check the systemd service again.
Proceed to compute nodes only when you are sure that the ctld is OK.
(IIRC there was a flag in the systemd service definition that had to be
adjusted after an upgrade, maybe you're hitting the same?)

Best,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: slurm nodes showing down*

2024-12-09 Thread Steffen Grunewald via slurm-users

iHi,

On Sun, 2024-12-08 at 21:57:11 +, Slurm users wrote:
> I have just rebuilt all my nodes and I see

Did they ever work before with Slurm? (Which version?)

> Only 1 & 2 seem available?
> While 3~6 are not

Either you didn't wait long enough (5 minutes should be sufficient),
or the "down*" nodes don't have a slurmd that talks to the slurmctld.
The reasons for the latter can only be speculated about.

> 3's log,
> 
> [root@node3 log]# tail slurmd.log
> [2024-12-08T21:45:51.250] CPU frequency setting not configured for this node
> [2024-12-08T21:45:51.251] slurmd version 20.11.9 started
> [2024-12-08T21:45:51.252] slurmd started on Sun, 08 Dec 2024 21:45:51 +
> [2024-12-08T21:45:51.252] CPUs=20 Boards=1 Sockets=20 Cores=1 Threads=1 
> Memory=48269 TmpDisk=23324 Uptime=30 CPUSpecList=(null) FeaturesAvail=(null) 
> FeaturesActive=(null)

Does this match (exceed, for Memory and TmpDisk) the node declaration
known by the slurmctld?

> And 7 doesnt want to talk to the controller.
> 
> [root@node7 slurm]# sinfo
> slurm_load_partitions: Zero Bytes were transmitted or received

Does it have munge running, with the right key?
I've seen this message when authorization was lost.

> These are all rebuilt and 1~3 are identical and 4~7 are identical.

Are the node declarations also identical, respectively?
Do they show the same features in slurmd.log?

> [root@vuwunicoslurmd1 slurm]# sinfo
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> debug*   up   infinite  2  idle* node[1-2]
> debug*   up   infinite  4  down* node[3-6]

What you see here is what the slurmctld sees.
The usual procedure to debug is to run the daemons that don't cooperate,
in debug mode.
Stop their services, start them manually one by one (ctld first), then
watch whether they talk to each other, and if they don't, learn what stops
them from doing so - then iterate editing the config, "scontrol reconfig",
lather, rinse, repeat.

You're the only one knowing your node configuration lines (NodeName=...),
so we can't help any further. Ole's pages perhaps can.

Best,
 S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Permission denied for slurmdbd.conf

2025-01-07 Thread Steffen Grunewald via slurm-users

On Sat, 2024-12-28 at 22:59:45 -, Slurm users wrote:
> ls -ls /usr/local/slurm/etc/slurmdbd.conf
> 4 -rw--- 1 slurm slurm 497 Dec 28 16:34 /usr/local/slurm/etc/slurmdbd.conf
> 
>  sudo -u slurm   /usr/local/slurm/sbin/slurmdbd -Dvvv
> 
> slurmdbd: error: s_p_parse_file: unable to read 
> "/usr/local/slurm/etc/slurmdbd.conf": Permission denied
> slurmdbd: fatal: Could not open/read/parse slurmdbd.conf file 
> /usr/local/slurm/etc/slurmdbd.conf

What are the permissions of the directory hosting the file (and the full tree
leading there)?

ls -ld /usr/local/slurm/etc


Best,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: formatting node names

2025-01-07 Thread Steffen Grunewald via slurm-users

On Mon, 2025-01-06 at 12:55:12 -0700, Slurm users wrote:
> Hi all,
> I remember seeing on this list a slurm command to change a slurm-friendly
> list such as
> 
> gpu[01-02],node[03-04,12-22,27-32,36]
> 
> into a bash friendly list such as
> 
> gpu01
> gpu02
> node03
> node04
> node12
> etc

I always forget that one as well ("scontrol show hostlist" works in the
opposite direction) but I have a workaround at hand:

pdsh -w gpu[01-02],node[03-04,12-22,27-32,36] -N -R exec echo %h

You may use "-f 1", if you prefer a sorted output.
(I use to pipe the output through "xargs" most of the time, too.)


Best,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

2025-01-07 Thread Steffen Grunewald via slurm-users

On Sat, 2025-01-04 at 08:11:21 -, Slurm users wrote:
> JOBID PARTITION NAME   USER  ST   TIME  NODES  
> NODELIST(REASON)
> 26   cpu myscriptuser1  PD   0:00  4 
> (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher 
> priority partitions)
> Anyone can help to  fix this?

Not without a little bit of extra information,
e.g. "sinfo -p cpu" and maybe "scontrol show job=26"

Best,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Unexpected node got allocation

2025-01-09 Thread Steffen Grunewald via slurm-users

On Thu, 2025-01-09 at 07:51:40 -0500, Slurm users wrote:
> Hello there and good morning from Baltimore.
> 
> I have a small cluster with 100 nodes. When the cluster is completely empty
> of all jobs, the first job gets allocated to node 41. In other clusters,
> the first job gets allocated to mode 01. If I specify node 01, the
> allocation works perfectly. I have my partition NodeName set as
> node[01-99], so having node41 used first is a surprise to me. We also have
> many other partitions which start with node41, but the partition being used
> for the allocation starts with node01.
> 
> Does anyone know what would cause this?

Just a wild guess, but do you have a topology.conf file that somehow makes
this node look most reasonable to use for a single-node job?
(Topology attempts to assign, or hold back, sections of your network to
maximize interconnect bandwidth for multi-node jobs. Your node41 might be
one - or the first one of a series - that would leave bigger chunks unused
for bigger tasks.)

HTH,
 Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: setting up slurmdbd (fail)

2025-03-04 Thread Steffen Grunewald via slurm-users

On Tue, 2025-03-04 at 01:03:00 +, Slurm users wrote:
> I am trying to add slurmdbd to my first attempt of slurmctld.
> 
> I have mariadb 10.11 running and permissions set.
> 
> MariaDB [(none)]> CREATE DATABASE slurm_acct_db;
> Query OK, 1 row affected (0.000 sec)
> 
> MariaDB [(none)]> show databases;
> ++
> | Database   |
> ++
> | information_schema |
> | slurm_acct_db  |
> ++
> 
> 
> Following the setup at,  
> https://slurm.schedmd.com/accounting.html#mysql-configuration
> 
> When I try to start slurmdbd it fails.
> 
> [root@vuwunicoslurmd3 ~]# systemctl status slurmdbd
> ? slurmdbd.service - Slurm DBD accounting daemon
>  Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; disabled; 
> preset: disabled)
>  Active: inactive (dead)
> [root@vuwunicoslurmd3 ~]# systemctl enable --now slurmdbd
> Created symlink /etc/systemd/system/multi-user.target.wants/slurmdbd.service 
> ? /usr/lib/systemd/system/slurmdbd.service.
> [root@vuwunicoslurmd3 ~]# systemctl status slurmdbd
> ? slurmdbd.service - Slurm DBD accounting daemon
>  Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; 
> preset: disabled)
>  Active: inactive (dead)
>   Condition: start condition failed at Tue 2025-03-04 00:54:38 UTC; 1s ago
>  ?? ConditionPathExists=/etc/slurm/slurmdbd.conf was not met

TIL about the "--now" option to "systemctl enable"... thanks for this one! ;)
although I admit to prefer a step-by-step approach (and I'd only enable a unit
if it's been successfully started once, to avoid complaints at reboot)...

You wrote that you configured MySQL but didn't mention SlurmDBD config.
Does the file that is being complained about exist (on that machine)?

> So there seems to be a hole in the guide.   Some config  is needed?

To be honest, I've been following Ole's detailed setup instructions since
Adam and Eve - not the ones directly from the horse's mouth.
Whatever, I'd first try to track down that ConditionPathExists issue...

Best, Steffen

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: single node configuration

[slurm-users] Re: FreeBSD/aarch64: ld: error: unknown emulation: elf_aarch64

[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

[slurm-users] Background tasks in Slurm scripts?

[slurm-users] Re: Background tasks in Slurm scripts?

[slurm-users] Re: Slurm fails before nvidia-smi command

[slurm-users] Find out submit host of past job?

[slurm-users] Re: Find out submit host of past job?

[slurm-users] Re: error: Unable to contact slurm controller (connect failure)

[slurm-users] Re: slurm nodes showing down*

[slurm-users] Re: Permission denied for slurmdbd.conf

[slurm-users] Re: formatting node names

[slurm-users] Re: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions

[slurm-users] Re: Unexpected node got allocation

[slurm-users] Re: setting up slurmdbd (fail)

15 matches

Site Navigation

Mail list logo

Footer information