[slurm-users] gres definitions

2020-12-14 Thread john abignail
Hi,

I have gres defined in my partition name. If I assign some bogus gres to a
node, the partition stops working. Somehow Slurm and the OS agree on the
gres installed. How do I find out all the named gres in my system, e.g.
specific cpu types, not just "cpu".

Thanks,

John


[slurm-users] slurmctld daemon error

2020-12-14 Thread Alpha Experiment
Hi,

I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
running correctly; however the slurmctld daemon always errors.
[admin@localhost ~]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
 Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
preset: disabled)
 Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
   Main PID: 2363 (slurmd)
  Tasks: 2
 Memory: 3.4M
CPU: 211ms
 CGroup: /system.slice/slurmd.service
 └─2363 /usr/local/sbin/slurmd -D
Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node daemon.
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
 Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
 └─override.conf
 Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST;
11min ago
Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS
(code=exited, status=1/FAILURE)
   Main PID: 1972 (code=exited, status=1/FAILURE)
CPU: 21ms
Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller
daemon.
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main
process exited, code=exited, status=1/FAILURE
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Failed
with result 'exit-code'.

The slurmctld log is as follows:
[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
cluster
[2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name
or service not known
[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
"localhost"
[2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not
supported
[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
[2020-12-14T16:02:12.772] Recovered state of 1 nodes
[2020-12-14T16:02:12.772] Recovered information about 0 jobs
[2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 1 partitions
[2020-12-14T16:02:12.779] Recovered state of 0 reservations
[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
[2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 1 partitions
[2020-12-14T16:02:12.779] Running as primary controller
[2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
[2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name
or service not known
[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"
[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port
without address family
[2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
Address family not supported by protocol
[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address
family not supported by protocol

Strangely, the daemon works fine when it is rebooted. After running
systemctl restart slurmctld.service

the service status is
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
 Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor
preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
 └─override.conf
 Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
   Main PID: 2815 (slurmctld)
  Tasks: 7
 Memory: 1.9M
CPU: 15ms
 CGroup: /system.slice/slurmctld.service
 └─2815 /usr/local/sbin/slurmctld -D
Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller
daemon.

Could anyone point me towards how to fix this? I expect it's just an issue
with my configuration file, which I've copied below for reference.
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=localhost
ControlMachine=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/home/slurm/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/home/slurm/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd/
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/home/slurm/spool/
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Luke Yeager
What does your ‘slurmctld.service’ look like? You might want to add something 
to the ‘After=’ section if your service is starting too quickly.

e.g. we use ‘After=network.target munge.service’ (see 
here).

From: slurm-users  On Behalf Of Alpha 
Experiment
Sent: Monday, December 14, 2020 4:20 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] slurmctld daemon error

External email: Use caution opening links or attachments

Hi,

I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running 
correctly; however the slurmctld daemon always errors.
[admin@localhost ~]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
 Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor 
preset: disabled)
 Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
   Main PID: 2363 (slurmd)
  Tasks: 2
 Memory: 3.4M
CPU: 211ms
 CGroup: /system.slice/slurmd.service
 └─2363 /usr/local/sbin/slurmd -D
Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node daemon.
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
 Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor 
preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
 └─override.conf
 Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST; 
11min ago
Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS 
(code=exited, status=1/FAILURE)
   Main PID: 1972 (code=exited, status=1/FAILURE)
CPU: 21ms
Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller 
daemon.
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main 
process exited, code=exited, status=1/FAILURE
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Failed 
with result 'exit-code'.

The slurmctld log is as follows:
[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster cluster
[2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name or 
service not known
[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve "localhost"
[2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not 
supported
[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
[2020-12-14T16:02:12.772] Recovered state of 1 nodes
[2020-12-14T16:02:12.772] Recovered information about 0 jobs
[2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: 
select/cons_tres: preparing for 1 partitions
[2020-12-14T16:02:12.779] Recovered state of 0 reservations
[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
[2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: 
select/cons_tres: reconfigure
[2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: 
select/cons_tres: preparing for 1 partitions
[2020-12-14T16:02:12.779] Running as primary controller
[2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
[2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name or 
service not known
[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"
[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port without 
address family
[2020-12-14T16:02:12.782] error: Error creating slurm stream socket: Address 
family not supported by protocol
[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address 
family not supported by protocol

Strangely, the daemon works fine when it is rebooted. After running
systemctl restart slurmctld.service

the service status is
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
 Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor 
preset: disabled)
Drop-In: /etc/systemd/system/slurmctld.service.d
 └─override.conf
 Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
   Main PID: 2815 (slurmctld)
  Tasks: 7
 Memory: 1.9M
CPU: 15ms
 CGroup: /system.slice/slurmctld.service
 └─2815 /usr/local/sbin/slurmctld -D
Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm controller 
daemon.

Could anyone point me towards how to fix this? I expect it's just an issue with 
my configuration file, which I've copied below for reference.
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=localhost
ControlMachine=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Avery Grieve
Hey Luke, I'm getting the same issues with my slurmctld daemon not starting
on boot (as well as my slurmd daemon). Both fail with the same messages
John got above (just Exit Code).

My slurmctld service file in /etc/systemd/system/ looks like this:

[Unit]
Description=Slurm controller daemon
After=network.target munge.service
ConditionPathExists=/etc/slurm-llnl/slurm.conf

[Service]
Type=simple
EnvironmentFile=-/etc/default/slurmctld
ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Similar to John, my daemon starts if I just run the systemctl start command
following boot.

~Avery Grieve
They/Them/Theirs please!
University of Michigan


On Mon, Dec 14, 2020 at 8:06 PM Luke Yeager  wrote:

> What does your ‘slurmctld.service’ look like? You might want to add
> something to the ‘After=’ section if your service is starting too quickly.
>
>
>
> e.g. we use ‘After=network.target munge.service’ (see here
> ).
>
>
>
>
> *From:* slurm-users  *On Behalf Of
> *Alpha Experiment
> *Sent:* Monday, December 14, 2020 4:20 PM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* [slurm-users] slurmctld daemon error
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
> Hi,
>
>
>
> I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
> running correctly; however the slurmctld daemon always errors.
>
> [admin@localhost ~]$ systemctl status slurmd.service
> ● slurmd.service - Slurm node daemon
>  Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
> preset: disabled)
>  Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
>Main PID: 2363 (slurmd)
>   Tasks: 2
>  Memory: 3.4M
> CPU: 211ms
>  CGroup: /system.slice/slurmd.service
>  └─2363 /usr/local/sbin/slurmd -D
> Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
> daemon.
>
> [admin@localhost ~]$ systemctl status slurmctld.service
> ● slurmctld.service - Slurm controller daemon
>  Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> vendor preset: disabled)
> Drop-In: /etc/systemd/system/slurmctld.service.d
>  └─override.conf
>  Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 PST;
> 11min ago
> Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>Main PID: 1972 (code=exited, status=1/FAILURE)
> CPU: 21ms
> Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm controller
> daemon.
> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main
> process exited, code=exited, status=1/FAILURE
> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> Failed with result 'exit-code'.
>
>
>
> The slurmctld log is as follows:
>
> [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
> cluster
> [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
> [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
> "localhost"
> [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not
> supported
> [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
> [2020-12-14T16:02:12.772] Recovered state of 1 nodes
> [2020-12-14T16:02:12.772] Recovered information about 0 jobs
> [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2020-12-14T16:02:12.779] Recovered state of 0 reservations
> [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
> [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
> select/cons_tres: reconfigure
> [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
> select/cons_tres: preparing for 1 partitions
> [2020-12-14T16:02:12.779] Running as primary controller
> [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
> [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
> [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: Name
> or service not known
> [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve "(null)"
> [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port
> without address family
> [2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
> Address family not supported by protocol
>
> [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error Address
> family not supported by protocol
>
>
>
> Strangely, the daemon works fine when it is rebooted. After running
>
> systemctl restart slurmctld.service
>
>
>
> the service status is
>
> [admin@localhost ~]$ syst

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Alpha Experiment
Hi Luke and Avery,

Changed the After line in the slurmctld.service file to
After=network.target munge.service slurmd.service

This seemed to do the trick!

Best,
John

On Mon, Dec 14, 2020 at 6:10 PM Avery Grieve  wrote:

> Hey Luke, I'm getting the same issues with my slurmctld daemon not
> starting on boot (as well as my slurmd daemon). Both fail with the same
> messages John got above (just Exit Code).
>
> My slurmctld service file in /etc/systemd/system/ looks like this:
>
> [Unit]
> Description=Slurm controller daemon
> After=network.target munge.service
> ConditionPathExists=/etc/slurm-llnl/slurm.conf
>
> [Service]
> Type=simple
> EnvironmentFile=-/etc/default/slurmctld
> ExecStart=/usr/local/sbin/slurmctld -D $SLURMCTLD_OPTIONS
> ExecReload=/bin/kill -HUP $MAINPID
> LimitNOFILE=65536
>
> [Install]
> WantedBy=multi-user.target
>
> Similar to John, my daemon starts if I just run the systemctl start
> command following boot.
>
> ~Avery Grieve
> They/Them/Theirs please!
> University of Michigan
>
>
> On Mon, Dec 14, 2020 at 8:06 PM Luke Yeager  wrote:
>
>> What does your ‘slurmctld.service’ look like? You might want to add
>> something to the ‘After=’ section if your service is starting too quickly.
>>
>>
>>
>> e.g. we use ‘After=network.target munge.service’ (see here
>> ).
>>
>>
>>
>>
>> *From:* slurm-users  *On Behalf
>> Of *Alpha Experiment
>> *Sent:* Monday, December 14, 2020 4:20 PM
>> *To:* slurm-users@lists.schedmd.com
>> *Subject:* [slurm-users] slurmctld daemon error
>>
>>
>>
>> *External email: Use caution opening links or attachments*
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
>> running correctly; however the slurmctld daemon always errors.
>>
>> [admin@localhost ~]$ systemctl status slurmd.service
>> ● slurmd.service - Slurm node daemon
>>  Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor
>> preset: disabled)
>>  Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
>>Main PID: 2363 (slurmd)
>>   Tasks: 2
>>  Memory: 3.4M
>> CPU: 211ms
>>  CGroup: /system.slice/slurmd.service
>>  └─2363 /usr/local/sbin/slurmd -D
>> Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
>> daemon.
>>
>> [admin@localhost ~]$ systemctl status slurmctld.service
>> ● slurmctld.service - Slurm controller daemon
>>  Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
>> vendor preset: disabled)
>> Drop-In: /etc/systemd/system/slurmctld.service.d
>>  └─override.conf
>>  Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12
>> PST; 11min ago
>> Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
>> $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
>>Main PID: 1972 (code=exited, status=1/FAILURE)
>> CPU: 21ms
>> Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm
>> controller daemon.
>> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: Main
>> process exited, code=exited, status=1/FAILURE
>> Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
>> Failed with result 'exit-code'.
>>
>>
>>
>> The slurmctld log is as follows:
>>
>> [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
>> cluster
>> [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
>> [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed:
>> Name or service not known
>> [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
>> "localhost"
>> [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' not
>> supported
>> [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
>> [2020-12-14T16:02:12.772] Recovered state of 1 nodes
>> [2020-12-14T16:02:12.772] Recovered information about 0 jobs
>> [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 1 partitions
>> [2020-12-14T16:02:12.779] Recovered state of 0 reservations
>> [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
>> [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
>> select/cons_tres: reconfigure
>> [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 1 partitions
>> [2020-12-14T16:02:12.779] Running as primary controller
>> [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
>> [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
>> [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed:
>> Name or service not known
>> [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve
>> "(null)"
>> [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set port
>> without address family
>> [2020-12-14T16:02:12.782]

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Brian Andrus
Check your hosts file and ensure 'localhost' does not have an IPV6 
address associated with it.


Brian Andrus

On 12/14/2020 4:19 PM, Alpha Experiment wrote:

Hi,

I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is 
running correctly; however the slurmctld daemon always errors.

[admin@localhost ~]$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
     Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; 
vendor preset: disabled)

     Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min ago
   Main PID: 2363 (slurmd)
      Tasks: 2
     Memory: 3.4M
        CPU: 211ms
     CGroup: /system.slice/slurmd.service
             └─2363 /usr/local/sbin/slurmd -D
Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node 
daemon.

[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; 
vendor preset: disabled)

    Drop-In: /etc/systemd/system/slurmctld.service.d
             └─override.conf
     Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12 
PST; 11min ago
    Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D 
$SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)

   Main PID: 1972 (code=exited, status=1/FAILURE)
        CPU: 21ms
Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm 
controller daemon.
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: 
Main process exited, code=exited, status=1/FAILURE
Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service: 
Failed with result 'exit-code'.


The slurmctld log is as follows:
[2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster 
cluster

[2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
[2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed: 
Name or service not known
[2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve 
"localhost"
[2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0' 
not supported

[2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
[2020-12-14T16:02:12.772] Recovered state of 1 nodes
[2020-12-14T16:02:12.772] Recovered information about 0 jobs
[2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array: 
select/cons_tres: preparing for 1 partitions

[2020-12-14T16:02:12.779] Recovered state of 0 reservations
[2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not specified
[2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure: 
select/cons_tres: reconfigure
[2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array: 
select/cons_tres: preparing for 1 partitions

[2020-12-14T16:02:12.779] Running as primary controller
[2020-12-14T16:02:12.780] No parameter for mcs plugin, default values set
[2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
[2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed: 
Name or service not known
[2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve 
"(null)"
[2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set 
port without address family
[2020-12-14T16:02:12.782] error: Error creating slurm stream socket: 
Address family not supported by protocol
[2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error 
Address family not supported by protocol


Strangely, the daemon works fine when it is rebooted. After running
systemctl restart slurmctld.service

the service status is
[admin@localhost ~]$ systemctl status slurmctld.service
● slurmctld.service - Slurm controller daemon
     Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; 
vendor preset: disabled)

    Drop-In: /etc/systemd/system/slurmctld.service.d
             └─override.conf
     Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
   Main PID: 2815 (slurmctld)
      Tasks: 7
     Memory: 1.9M
        CPU: 15ms
     CGroup: /system.slice/slurmctld.service
             └─2815 /usr/local/sbin/slurmctld -D
Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm 
controller daemon.


Could anyone point me towards how to fix this? I expect it's just an 
issue with my configuration file, which I've copied below for reference.

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=localhost
ControlMachine=localhost
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/home/slurm/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/home/slurm/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd/
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/home/slurm/spool/
SwitchType=switch/none
TaskPlugin=task/affinity
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Alpha Experiment
Hi Brian,

My hosts file looks like this:
127.0.0.1   localhost localhost.localdomain localhost4
localhost4.localdomain4
::1 localhost localhost.localdomain localhost6
localhost6.localdomain6

I believe the second is an IPV6 address. Is it safe to delete that line?

Best,
John


On Mon, Dec 14, 2020 at 11:10 PM Brian Andrus  wrote:
>
> Check your hosts file and ensure 'localhost' does not have an IPV6
> address associated with it.
>
> Brian Andrus
>
> On 12/14/2020 4:19 PM, Alpha Experiment wrote:
> > Hi,
> >
> > I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is
> > running correctly; however the slurmctld daemon always errors.
> > [admin@localhost ~]$ systemctl status slurmd.service
> > ● slurmd.service - Slurm node daemon
> >  Loaded: loaded (/etc/systemd/system/slurmd.service; enabled;
> > vendor preset: disabled)
> >  Active: active (running) since Mon 2020-12-14 16:02:18 PST; 11min
ago
> >Main PID: 2363 (slurmd)
> >   Tasks: 2
> >  Memory: 3.4M
> > CPU: 211ms
> >  CGroup: /system.slice/slurmd.service
> >  └─2363 /usr/local/sbin/slurmd -D
> > Dec 14 16:02:18 localhost.localdomain systemd[1]: Started Slurm node
> > daemon.
> > [admin@localhost ~]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> >  Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> > vendor preset: disabled)
> > Drop-In: /etc/systemd/system/slurmctld.service.d
> >  └─override.conf
> >  Active: failed (Result: exit-code) since Mon 2020-12-14 16:02:12
> > PST; 11min ago
> > Process: 1972 ExecStart=/usr/local/sbin/slurmctld -D
> > $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
> >Main PID: 1972 (code=exited, status=1/FAILURE)
> > CPU: 21ms
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: Started Slurm
> > controller daemon.
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> > Main process exited, code=exited, status=1/FAILURE
> > Dec 14 16:02:12 localhost.localdomain systemd[1]: slurmctld.service:
> > Failed with result 'exit-code'.
> >
> > The slurmctld log is as follows:
> > [2020-12-14T16:02:12.731] slurmctld version 20.11.1 started on cluster
> > cluster
> > [2020-12-14T16:02:12.739] No memory enforcing mechanism configured.
> > [2020-12-14T16:02:12.772] error: get_addr_info: getaddrinfo() failed:
> > Name or service not known
> > [2020-12-14T16:02:12.772] error: slurm_set_addr: Unable to resolve
> > "localhost"
> > [2020-12-14T16:02:12.772] error: slurm_get_port: Address family '0'
> > not supported
> > [2020-12-14T16:02:12.772] error: _set_slurmd_addr: failure on localhost
> > [2020-12-14T16:02:12.772] Recovered state of 1 nodes
> > [2020-12-14T16:02:12.772] Recovered information about 0 jobs
> > [2020-12-14T16:02:12.772] select/cons_tres: part_data_create_array:
> > select/cons_tres: preparing for 1 partitions
> > [2020-12-14T16:02:12.779] Recovered state of 0 reservations
> > [2020-12-14T16:02:12.779] read_slurm_conf: backup_controller not
specified
> > [2020-12-14T16:02:12.779] select/cons_tres: select_p_reconfigure:
> > select/cons_tres: reconfigure
> > [2020-12-14T16:02:12.779] select/cons_tres: part_data_create_array:
> > select/cons_tres: preparing for 1 partitions
> > [2020-12-14T16:02:12.779] Running as primary controller
> > [2020-12-14T16:02:12.780] No parameter for mcs plugin, default values
set
> > [2020-12-14T16:02:12.780] mcs: MCSParameters = (null). ondemand set.
> > [2020-12-14T16:02:12.780] error: get_addr_info: getaddrinfo() failed:
> > Name or service not known
> > [2020-12-14T16:02:12.780] error: slurm_set_addr: Unable to resolve
> > "(null)"
> > [2020-12-14T16:02:12.780] error: slurm_set_port: attempting to set
> > port without address family
> > [2020-12-14T16:02:12.782] error: Error creating slurm stream socket:
> > Address family not supported by protocol
> > [2020-12-14T16:02:12.782] fatal: slurm_init_msg_engine_port error
> > Address family not supported by protocol
> >
> > Strangely, the daemon works fine when it is rebooted. After running
> > systemctl restart slurmctld.service
> >
> > the service status is
> > [admin@localhost ~]$ systemctl status slurmctld.service
> > ● slurmctld.service - Slurm controller daemon
> >  Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled;
> > vendor preset: disabled)
> > Drop-In: /etc/systemd/system/slurmctld.service.d
> >  └─override.conf
> >  Active: active (running) since Mon 2020-12-14 16:14:24 PST; 3s ago
> >Main PID: 2815 (slurmctld)
> >   Tasks: 7
> >  Memory: 1.9M
> > CPU: 15ms
> >  CGroup: /system.slice/slurmctld.service
> >  └─2815 /usr/local/sbin/slurmctld -D
> > Dec 14 16:14:24 localhost.localdomain systemd[1]: Started Slurm
> > controller daemon.
> >
> > Could anyone point me towards how to fix this? I expect it's just an
> > issue with my configuration file, which I've copied below for referenc

[slurm-users] Scripts run slower in slurm?

2020-12-14 Thread Alpha Experiment
Hi,

I made a short script in python to test if slurm was correctly limiting the
number of CPUs available to each partition. The script is as follows:
import multiprocessing as mp
import time as t

def fibonacci(n):
n = int(n)
def fibon(a,b,n,result):
c = a+b
result.append(c)
if c < n:
fibon(b,c,n,result)
return result
return fibon(0,1,n,[])

def calcnfib(n):
res = fibonacci(n)
return res[-1]

def benchmark(pool):
t0 = t.time()
out = pool.map(calcnfib, range(100, 10,1000))
tf = t.time()
return str(tf-t0)

pool = mp.Pool(4)
print("4: " + benchmark(pool))

pool = mp.Pool(32)
print("32: " + benchmark(pool))

pool = mp.Pool(64)
print("64: " + benchmark(pool))

pool = mp.Pool(128)
print("128: " + benchmark(pool))

It is called using the following submission script:
#!/bin/bash
#SBATCH --partition=full
#SBATCH --job-name="Large"
source testenv1/bin/activate
python3 multithread_example.py

The slurm out file reads
4: 5.660163640975952
32: 5.762076139450073
64: 5.8220226764678955
128: 5.85421347618103

However, if I run
source testenv1/bin/activate
python3 multithread_example.py

I find faster and more expected behavior
4: 1.5878620147705078
32: 0.34162330627441406
64: 0.24987316131591797
128: 0.2247719764709472

For reference my slurm configuration file is
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#SlurmctldHost=localhost
ControlMachine=localhost

#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/home/slurm/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/home/slurm/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurm/slurmd/
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/home/slurm/spool/
SwitchType=switch/none
TaskPlugin=task/affinity

# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=info
SlurmctldLogFile=/home/slurm/log/slurmctld.log
#SlurmdDebug=info
#SlurmdLogFile=

# COMPUTE NODES
NodeName=localhost CPUs=128 RealMemory=257682 Sockets=1 CoresPerSocket=64
ThreadsPerCore=2 State=UNKNOWN
PartitionName=full Nodes=localhost Default=YES MaxTime=INFINITE State=UP
PartitionName=half Nodes=localhost Default=NO MaxTime=INFINITE State=UP
MaxNodes=1 MaxCPUsPerNode=64 MaxMemPerNode=128841

Here is my cgroup.conf file as well
CgroupAutomount=yes
ConstrainCores=no
ConstrainRAMSpace=no

If anyone has any suggestions for what might be going wrong and why the
script takes much longer when run with slurm, please let me know!

Best,
John