[slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Hi everyone,

I have set-up slurm to use slurm_db and all was working fine. However I had
to change the slurm.conf to play with user priority and upon restarting the
slurmctl is fails with the following messages below. It seems that somehow
is trying to use the mysql password as a munge socket?
Any idea how to solve it?


> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port
> 6817 with slurmdbd.
> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart
> with --num-threads=10
> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
> Failed to access "magic": No such file or directory
> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
> communication error
> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open:
> failed to send persistent connection init message to localhost:6819
> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
> PersistInit msg: Protocol authentication error
> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart
> with --num-threads=10
> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed:
> Failed to access "magic": No such file or directory
> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket
> communication error
> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open:
> failed to send persistent connection init message to localhost:6819
> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
> PersistInit msg: Protocol authentication error
> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart
> with --num-threads=10
> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed:
> Failed to access "magic": No such file or directory
> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket
> communication error
> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open:
> failed to send persistent connection init message to localhost:6819
> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
> PersistInit msg: Protocol authentication error
> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have
> any association data from your database.  The priority/multifactor plugin
> requires this information to run correctly.  Please check your database
> connection and try again.
> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process
> exited, code=exited, status=1/FAILURE
> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed
> state.
> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result
> 'exit-code'.


Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Andy Riebs

It looks like you don't have the munged daemon running.

On 11/29/2017 08:01 AM, Bruno Santos wrote:

Hi everyone,

I have set-up slurm to use slurm_db and all was working fine. However 
I had to change the slurm.conf to play with user priority and upon 
restarting the slurmctl is fails with the following messages below. It 
seems that somehow is trying to use the mysql password as a munge socket?

Any idea how to solve it?

Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at
port 6817 with slurmdbd.
Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up,
restart with --num-threads=10
Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode
failed: Failed to access "magic": No such file or directory
Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication:
Socket communication error
Nov 29 12:56:32 plantae slurmctld[29613]: error:
slurm_persist_conn_open: failed to send persistent connection init
message to localhost:6819
Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
PersistInit msg: Protocol authentication error
Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up,
restart with --num-threads=10
Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode
failed: Failed to access "magic": No such file or directory
Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication:
Socket communication error
Nov 29 12:56:34 plantae slurmctld[29613]: error:
slurm_persist_conn_open: failed to send persistent connection init
message to localhost:6819
Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
PersistInit msg: Protocol authentication error
Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up,
restart with --num-threads=10
Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode
failed: Failed to access "magic": No such file or directory
Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication:
Socket communication error
Nov 29 12:56:36 plantae slurmctld[29613]: error:
slurm_persist_conn_open: failed to send persistent connection init
message to localhost:6819
Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
PersistInit msg: Protocol authentication error
Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you
don't have any association data from your database.  The
priority/multifactor plugin requires this information to run
correctly.  Please check your database connection and try again.
Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main
process exited, code=exited, status=1/FAILURE
Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit
entered failed state.
Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with
result 'exit-code'.






Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
I actually just managed to figure that one out.

The problem was that I had setup AccountingStoragePass=magic in the
slurm.conf file while after re-reading the documentation it seems this is
only needed if I have a different munge instance controlling the logins to
the database, which I don't.
So commenting that line out seems to have worked however I am now getting a
different error:

> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
> 6817 with slurmdbd.
> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open:
> Something happened with the receiving/processing of the persistent
> connection init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
> exited, code=exited, status=1/FAILURE
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed
> state.
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result
> 'exit-code'.


My slurm.conf looks like this

> # LOGGING AND ACCOUNTING
> AccountingStorageHost=localhost
> AccountingStorageLoc=slurm_db
> #AccountingStoragePass=magic
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> AccountingStoreJobComment=YES
> ClusterName=research
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> SlurmdDebug=3


And the slurdbd.conf like this:

> ArchiveEvents=yes
> ArchiveJobs=yes
> ArchiveResvs=yes
> ArchiveSteps=no
> #ArchiveTXN=no
> #ArchiveUsage=no
> # Authentication info
> AuthType=auth/munge
> AuthInfo=/var/run/munge/munge.socket.2

#Database info
> # slurmDBD info
> DbdAddr=plantae
> DbdHost=plantae
> # Database info
> StorageType=accounting_storage/mysql
> StorageHost=localhost
> SlurmUser=slurm
> StoragePass=magic
> StorageUser=slurm
> StorageLoc=slurm_db



Thank you very much in advance.

Best,
Bruno


On 29 November 2017 at 13:28, Andy Riebs  wrote:

> It looks like you don't have the munged daemon running.
>
>
> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>
> Hi everyone,
>
> I have set-up slurm to use slurm_db and all was working fine. However I
> had to change the slurm.conf to play with user priority and upon restarting
> the slurmctl is fails with the following messages below. It seems that
> somehow is trying to use the mysql password as a munge socket?
> Any idea how to solve it?
>
>
>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port
>> 6817 with slurmdbd.
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart
>> with --num-threads=10
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
>> Failed to access "magic": No such file or directory
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
>> communication error
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open:
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart
>> with --num-threads=10
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed:
>> Failed to access "magic": No such file or directory
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket
>> communication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open:
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart
>> with --num-threads=10
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed:
>> Failed to access "magic": No such file or directory
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket
>> communication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open:
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't
>> have any association data from your database.  The priority/multifactor
>> plugin requires this information to run correctly.  Please check your
>> database connection and try again.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered
>> failed state.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result
>> 'exit-code'.
>
>
>
>
>
>


Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
I was struggling like crazy with this one a while ago.
Then I saw this in the slurm.conf man page:

AccountingStoragePass
The  password  used  to gain access to the database to store the accounting 
data.  Only used for database type storage plugins, ignored otherwise.  In the 
case of
  Slurm DBD (Database Daemon) with MUNGE authentication this can be 
configured to use a MUNGE daemon specifically configured to provide 
authentication between clus‐
  ters  while  the  default MUNGE daemon provides authentication 
within a cluster.  In that case, AccountingStoragePass should specify the named 
port to be used for
  communications with the alternate MUNGE daemon (e.g.  
"/var/run/munge/global.socket.2"). The default value is NULL.  Also see 
DefaultStoragePass.

So in case you are using MUNGE, you leave this out in slurm.conf, because the 
path to the socket is used as default. You specify the database password only 
in slurmdbd.conf.

Cheers,
Barbara

> On 29 Nov 2017, at 14:28, Andy Riebs  wrote:
> 
> It looks like you don't have the munged daemon running.
> 
> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>> Hi everyone,
>> 
>> I have set-up slurm to use slurm_db and all was working fine. However I had 
>> to change the slurm.conf to play with user priority and upon restarting the 
>> slurmctl is fails with the following messages below. It seems that somehow 
>> is trying to use the mysql password as a munge socket?
>> Any idea how to solve it?
>> 
>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 
>> with slurmdbd.
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have 
>> any association data from your database.  The priority/multifactor plugin 
>> requires this information to run correctly.  Please check your database 
>> connection and try again.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Main process exited, 
>> code=exited, status=1/FAILURE
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Unit entered failed 
>> state.
>> Nov 29 12:56:36 plantae systemd[1]: slurmctld.service: Failed with result 
>> 'exit-code'.
>> 
>> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
Hello,

does munge work?
Try if decode works locally:
munge -n | unmunge
Try if decode works remotely:
munge -n | ssh  unmunge

It seems as munge keys do not match...

See comments inline..

> On 29 Nov 2017, at 14:40, Bruno Santos  wrote:
> 
> I actually just managed to figure that one out.
> 
> The problem was that I had setup AccountingStoragePass=magic in the 
> slurm.conf file while after re-reading the documentation it seems this is 
> only needed if I have a different munge instance controlling the logins to 
> the database, which I don't.
> So commenting that line out seems to have worked however I am now getting a 
> different error:
> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 
> with slurmdbd.
> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: 
> Something happened with the receiving/processing of the persistent connection 
> init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, 
> code=exited, status=1/FAILURE
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed 
> state.
> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 
> 'exit-code'.
> 
> My slurm.conf looks like this
> # LOGGING AND ACCOUNTING
> AccountingStorageHost=localhost
> AccountingStorageLoc=slurm_db
> #AccountingStoragePass=magic
> #AccountingStoragePort=
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> AccountingStoreJobComment=YES
> ClusterName=research
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> SlurmdDebug=3

You only need:
AccountingStorageEnforce=associations,limits,qos
AccountingStorageHost=
AccountingStorageType=accounting_storage/slurmdbd

You can remove AccountingStorageLoc and AccountingStorageUser.


> 
> And the slurdbd.conf like this:
> ArchiveEvents=yes
> ArchiveJobs=yes
> ArchiveResvs=yes
> ArchiveSteps=no
> #ArchiveTXN=no
> #ArchiveUsage=no
> # Authentication info
> AuthType=auth/munge
> AuthInfo=/var/run/munge/munge.socket.2
> #Database info
> # slurmDBD info
> DbdAddr=plantae
> DbdHost=plantae
> # Database info
> StorageType=accounting_storage/mysql
> StorageHost=localhost
> SlurmUser=slurm
> StoragePass=magic
> StorageUser=slurm
> StorageLoc=slurm_db
> 
> 
> Thank you very much in advance.
> 
> Best,
> Bruno

Cheers,
Barbara

> 
> 
> On 29 November 2017 at 13:28, Andy Riebs  > wrote:
> It looks like you don't have the munged daemon running.
> 
> 
> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>> Hi everyone,
>> 
>> I have set-up slurm to use slurm_db and all was working fine. However I had 
>> to change the slurm.conf to play with user priority and upon restarting the 
>> slurmctl is fails with the following messages below. It seems that somehow 
>> is trying to use the mysql password as a munge socket?
>> Any idea how to solve it?
>> 
>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port 6817 
>> with slurmdbd.
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:32 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:34 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: If munged is up, restart 
>> with --num-threads=10
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: Munge encode failed: Failed 
>> to access "magic": No such file or directory
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: authentication: Socket 
>> communication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurm_persist_conn_open: 
>> failed to send persistent connection init message to localhost:6819
>> Nov 29 12:56:36 plantae slurmctld[29613]: error: slurmdbd: Sending 
>> PersistInit msg: Protocol authentication error
>> Nov 29 12:56:36 plantae slurmctld[29613]: fatal: It appears you don't have 
>> any associatio

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Thank you Barbara,

Unfortunately, it does not seem to be a munge problem. Munge can
successfully authenticate with the nodes.

I have increased the verbosity level and restarted the slurmctld and now I
am getting more information about this:

> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port
>> 6817 with slurmdbd.
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>> Something happened with the receiving/processing of the persistent
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>> PersistInit msg: No error
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>> Something happened with the receiving/processing of the persistent
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>> PersistInit msg: No error
>
> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have
>> any association data from your database.  The priority/multifactor plugin
>> requires this information to run correctly.  Please check your database
>> connection and try again.
>
>
The problem seems to somehow be related to slurmdbd?
I am a bit lost at this point, to be honest.

Best,
Bruno

On 29 November 2017 at 14:06, Barbara Krašovec 
wrote:

> Hello,
>
> does munge work?
> Try if decode works locally:
> munge -n | unmunge
> Try if decode works remotely:
> munge -n | ssh  unmunge
>
> It seems as munge keys do not match...
>
> See comments inline..
>
> On 29 Nov 2017, at 14:40, Bruno Santos  wrote:
>
> I actually just managed to figure that one out.
>
> The problem was that I had setup AccountingStoragePass=magic in the
> slurm.conf file while after re-reading the documentation it seems this is
> only needed if I have a different munge instance controlling the logins to
> the database, which I don't.
> So commenting that line out seems to have worked however I am now getting
> a different error:
>
>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
>> 6817 with slurmdbd.
>> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open:
>> Something happened with the receiving/processing of the persistent
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
>> exited, code=exited, status=1/FAILURE
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered
>> failed state.
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result
>> 'exit-code'.
>
>
> My slurm.conf looks like this
>
>> # LOGGING AND ACCOUNTING
>> AccountingStorageHost=localhost
>> AccountingStorageLoc=slurm_db
>> #AccountingStoragePass=magic
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageUser=slurm
>> AccountingStoreJobComment=YES
>> ClusterName=research
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> SlurmctldDebug=3
>> SlurmdDebug=3
>
>
> You only need:
> AccountingStorageEnforce=associations,limits,qos
> AccountingStorageHost=
> AccountingStorageType=accounting_storage/slurmdbd
>
> You can remove AccountingStorageLoc and AccountingStorageUser.
>
>
>
> And the slurdbd.conf like this:
>
>> ArchiveEvents=yes
>> ArchiveJobs=yes
>> ArchiveResvs=yes
>> ArchiveSteps=no
>> #ArchiveTXN=no
>> #ArchiveUsage=no
>> # Authentication info
>> AuthType=auth/munge
>> AuthInfo=/var/run/munge/munge.socket.2
>
> #Database info
>> # slurmDBD info
>> DbdAddr=plantae
>> DbdHost=plantae
>> # Database info
>> StorageType=accounting_storage/mysql
>> StorageHost=localhost
>> SlurmUser=slurm
>> StoragePass=magic
>> StorageUser=slurm
>> StorageLoc=slurm_db
>
>
>
> Thank you very much in advance.
>
> Best,
> Bruno
>
>
> Cheers,
> Barbara
>
>
>
> On 29 November 2017 at 13:28, Andy Riebs  wrote:
>
>> It looks like you don't have the munged daemon running.
>>
>>
>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>
>> Hi everyone,
>>
>> I have set-up slurm to use slurm_db and all was working fine. However I
>> had to change the slurm.conf to play with user priority and upon restarting
>> the slurmctl is fails with the following messages below. It seems that
>> somehow is trying to use the mysql password as a munge socket?
>> Any idea how to solve it?
>>
>>
>>> Nov 29 12:56:30 plantae slurmctld[29613]: Registering slurmctld at port
>>> 6817 with slurmdbd.
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: If munged is up,
>>> restart with --num-threads=10
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: Munge encode failed:
>>> Failed to access "magic": No such file or directory
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error: authentication: Socket
>>> communication error
>>> Nov 29 12:56:32 plantae slurmctld[29613]: error:
>>> slurm_persist_conn_open: failed to send

[slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread david vilanova
Hello,
I have installed latest 7.11 release and my node is shown as down.
I hava a single physical server with 12 cores so not sure the conf below is
correct ?? can you help ??

In slurm.conf the node is configure as follows:

NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1
ThreadsPerCore=1 Feature=local
PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE State=UP


Output form sinfo:
ubuntu@obione:~$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
testq*   up   infinite  1  down* linuxcluster



Thanks,


Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Barbara Krašovec
Did you upgrade SLURM or is it a fresh install?

Are there any associations set? For instance, did you create the cluster with 
sacctmgr?
sacctmgr add cluster 

Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a 
simple test, such as:
sacctmgr show user -s
If it was an upgrade, did you try to run the slurmdbd and slurmctld manuallly 
first:

slurmdbd -Dv

Then controller:

slurmctld -Dv

Which OS is that?
Is there a firewall/selinux/ACLs?

Cheers,
Barbara


> On 29 Nov 2017, at 15:19, Bruno Santos  wrote:
> 
> Thank you Barbara,
> 
> Unfortunately, it does not seem to be a munge problem. Munge can successfully 
> authenticate with the nodes.
> 
> I have increased the verbosity level and restarted the slurmctld and now I am 
> getting more information about this:
> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817 
> with slurmdbd.
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: 
> Something happened with the receiving/processing of the persistent connection 
> init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending 
> PersistInit msg: No error
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: 
> Something happened with the receiving/processing of the persistent connection 
> init message to localhost:6819: Initial RPC not DBD_INIT
> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending 
> PersistInit msg: No error
> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have 
> any association data from your database.  The priority/multifactor plugin 
> requires this information to run correctly.  Please check your database 
> connection and try again.
> 
> The problem seems to somehow be related to slurmdbd?
> I am a bit lost at this point, to be honest.
> 
> Best,
> Bruno
> 
> On 29 November 2017 at 14:06, Barbara Krašovec  > wrote:
> Hello,
> 
> does munge work?
> Try if decode works locally:
> munge -n | unmunge
> Try if decode works remotely:
> munge -n | ssh  unmunge
> 
> It seems as munge keys do not match...
> 
> See comments inline..
> 
>> On 29 Nov 2017, at 14:40, Bruno Santos > > wrote:
>> 
>> I actually just managed to figure that one out.
>> 
>> The problem was that I had setup AccountingStoragePass=magic in the 
>> slurm.conf file while after re-reading the documentation it seems this is 
>> only needed if I have a different munge instance controlling the logins to 
>> the database, which I don't.
>> So commenting that line out seems to have worked however I am now getting a 
>> different error:
>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 
>> with slurmdbd.
>> Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: 
>> Something happened with the receiving/processing of the persistent 
>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, 
>> code=exited, status=1/FAILURE
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed 
>> state.
>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 
>> 'exit-code'.
>> 
>> My slurm.conf looks like this
>> # LOGGING AND ACCOUNTING
>> AccountingStorageHost=localhost
>> AccountingStorageLoc=slurm_db
>> #AccountingStoragePass=magic
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageUser=slurm
>> AccountingStoreJobComment=YES
>> ClusterName=research
>> JobCompType=jobcomp/none
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/none
>> SlurmctldDebug=3
>> SlurmdDebug=3
> 
> You only need:
> AccountingStorageEnforce=associations,limits,qos
> AccountingStorageHost=
> AccountingStorageType=accounting_storage/slurmdbd
> 
> You can remove AccountingStorageLoc and AccountingStorageUser.
> 
> 
>> 
>> And the slurdbd.conf like this:
>> ArchiveEvents=yes
>> ArchiveJobs=yes
>> ArchiveResvs=yes
>> ArchiveSteps=no
>> #ArchiveTXN=no
>> #ArchiveUsage=no
>> # Authentication info
>> AuthType=auth/munge
>> AuthInfo=/var/run/munge/munge.socket.2
>> #Database info
>> # slurmDBD info
>> DbdAddr=plantae
>> DbdHost=plantae
>> # Database info
>> StorageType=accounting_storage/mysql
>> StorageHost=localhost
>> SlurmUser=slurm
>> StoragePass=magic
>> StorageUser=slurm
>> StorageLoc=slurm_db
>> 
>> 
>> Thank you very much in advance.
>> 
>> Best,
>> Bruno
> 
> Cheers,
> Barbara
> 
>> 
>> 
>> On 29 November 2017 at 13:28, Andy Riebs > > wrote:
>> It looks like you don't have the munged daemon running.
>> 
>> 
>> On 11/29/2017 08:01 AM, Bruno Santos wrote:
>>> Hi everyone,
>>> 
>>> I have set-up slurm to use slurm_db and all was working fine. However I had 
>>> to change the slurm.conf to play with user priority and upon restarting the 
>

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Steffen Grunewald
Hi David,

On Wed, 2017-11-29 at 14:45:06 +, david vilanova wrote:
> Hello,
> I have installed latest 7.11 release and my node is shown as down.
> I hava a single physical server with 12 cores so not sure the conf below is
> correct ?? can you help ??
> 
> In slurm.conf the node is configure as follows:
> 
> NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1
> ThreadsPerCore=1 Feature=local

12 Sockets? Certainly not... 12 Cores per socket, yes.
(IIRC CPUS shouldn't be specified if the detailed topology is given. 
You may try CPUs=12 and drop the details.)

> PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE State=UP
   ^^ typo?

Cheers,
 Steffen



Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Hi Barbara,

This is a fresh install. I have installed slurm from source on Debian
stretch and now trying to set it up correctly.
MariaDB is running for but I am confused about the database configuration.
I followed a tutorial (I can no longer find it) that showed me how to
create the database and give it to the slurm user on mysql. Haven't really
done anything further than that as running anything return the same errors:

root@plantae:~# sacctmgr show user -s
> sacctmgr: error: slurm_persist_conn_open: Something happened with the
> receiving/processing of the persistent connection init message to
> localhost:6819: Initial RPC not DBD_INIT
> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
> sacctmgr: error: slurm_persist_conn_open: Something happened with the
> receiving/processing of the persistent connection init message to
> localhost:6819: Initial RPC not DBD_INIT
> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
> sacctmgr: error: slurm_persist_conn_open: Something happened with the
> receiving/processing of the persistent connection init message to
> localhost:6819: Initial RPC not DBD_INIT
> sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
> sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error
>  Problem with query.




On 29 November 2017 at 14:46, Barbara Krašovec 
wrote:

> Did you upgrade SLURM or is it a fresh install?
>
> Are there any associations set? For instance, did you create the cluster
> with sacctmgr?
> sacctmgr add cluster 
>
> Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a
> simple test, such as:
>
> sacctmgr show user -s
>
> If it was an upgrade, did you try to run the slurmdbd and slurmctld
> manuallly first:
>
> slurmdbd -Dv
>
> Then controller:
>
> slurmctld -Dv
>
> Which OS is that?
> Is there a firewall/selinux/ACLs?
>
> Cheers,
> Barbara
>
>
> On 29 Nov 2017, at 15:19, Bruno Santos  wrote:
>
> Thank you Barbara,
>
> Unfortunately, it does not seem to be a munge problem. Munge can
> successfully authenticate with the nodes.
>
> I have increased the verbosity level and restarted the slurmctld and now I
> am getting more information about this:
>
>> Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port
>>> 6817 with slurmdbd.
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>>> Something happened with the receiving/processing of the persistent
>>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>>> PersistInit msg: No error
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open:
>>> Something happened with the receiving/processing of the persistent
>>> connection init message to localhost:6819: Initial RPC not DBD_INIT
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending
>>> PersistInit msg: No error
>>
>> Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't
>>> have any association data from your database.  The priority/multifactor
>>> plugin requires this information to run correctly.  Please check your
>>> database connection and try again.
>>
>>
> The problem seems to somehow be related to slurmdbd?
> I am a bit lost at this point, to be honest.
>
> Best,
> Bruno
>
> On 29 November 2017 at 14:06, Barbara Krašovec 
> wrote:
>
>> Hello,
>>
>> does munge work?
>> Try if decode works locally:
>> munge -n | unmunge
>> Try if decode works remotely:
>> munge -n | ssh  unmunge
>>
>> It seems as munge keys do not match...
>>
>> See comments inline..
>>
>> On 29 Nov 2017, at 14:40, Bruno Santos  wrote:
>>
>> I actually just managed to figure that one out.
>>
>> The problem was that I had setup AccountingStoragePass=magic in the
>> slurm.conf file while after re-reading the documentation it seems this is
>> only needed if I have a different munge instance controlling the logins to
>> the database, which I don't.
>> So commenting that line out seems to have worked however I am now getting
>> a different error:
>>
>>> Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port
>>> 6817 with slurmdbd.
>>> Nov 29 13:19:20 plantae slurmctld[29984]: error:
>>> slurm_persist_conn_open: Something happened with the receiving/processing
>>> of the persistent connection init message to localhost:6819: Initial RPC
>>> not DBD_INIT
>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process
>>> exited, code=exited, status=1/FAILURE
>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered
>>> failed state.
>>> Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with
>>> result 'exit-code'.
>>
>>
>> My slurm.conf looks like this
>>
>>> # LOGGING AND ACCOUNTING
>>> AccountingStorageHost=localhost
>>> AccountingStorageLoc=slurm_db
>>> #AccountingStoragePass=magic
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageUser=slurm
>

[slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Christian Anthon

Hi,

I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs 
seems to be stuck in ReqNodeNotAvail:


  6982 panic  Morgens    ferro PD   0:00 1 
(ReqNodeNotAvail, UnavailableNodes:)
  6981 panic SPEC    ferro PD   0:00 1 
(ReqNodeNotAvail, UnavailableNodes:)


The nodes are fully allocated in terms of memory, but not all cpu 
resources are consumed


PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
_default up   infinite 19    mix 
clone[05-11,25-29,31-32,36-37,39-40,45]

_default up   infinite 11  alloc alone[02-08,10-13]
fastlane up   infinite 19    mix 
clone[05-11,25-29,31-32,36-37,39-40,45]

fastlane up   infinite 11  alloc alone[02-08,10-13]
panic    up   infinite 19    mix 
clone[05-11,25-29,31-32,36-37,39-40,45]

panic    up   infinite 12  alloc alone[02-08,10-13,15]
free*    up   infinite 19    mix 
clone[05-11,25-29,31-32,36-37,39-40,45]

free*    up   infinite 11  alloc alone[02-08,10-13]

Possibly relevant lines in slurm.conf (full slurm.conf attached)

SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/none
FastSchedule=1

Any advice?

Cheers, Christian.

# Maintained by PUPPET, local edits will be lost

#General
ClusterName=rth
ControlMachine=rnai01
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
Proctracktype=proctrack/linuxproc
ReturnToService=1
MaxJobCount=1

#prolog/epilog
Prolog=/etc/slurm/scripts/slurm.prolog
Epilog=/etc/slurm/scripts/slurm.epilog
TaskProlog=/etc/slurm/scripts/slurm.task.prolog
TaskEpilog=/etc/slurm/scripts/slurm.task.epilog
SrunProlog=/etc/slurm/scripts/slurm.srun.prolog
SrunEpilog=/etc/slurm/scripts/slurm.srun.epilog

#TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0

#SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/none
FastSchedule=1

#Job priority
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityFavorSmall=NO
PriorityMaxAge=14-0
PriorityWeightAge=1000
PriorityWeightFairshare=1
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
PriorityWeightQOS=0 # don't use the qos factor

#LOGGING
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurmd.log
JobCompType=jobcomp/none

#ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
AccountingStorageEnforce=limits,qos
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=rnai01

#Job defaults
DefMemPerCPU=1024

#Privacy
PrivateData=accounts,jobs,reservations,usage,users
UsePAM=1

#COMPUTE NODES
TmpFS=/tmp
NodeName=alone[02-08,10-13,15] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 
RealMemory=64386 TmpDisk=426554 State=UNKNOWN
NodeName=clone[05-11,25-29,31-32,36-37,39-40,45] Sockets=2 CoresPerSocket=4 
ThreadsPerCore=2 RealMemory=24019 TmpDisk=446453 State=UNKNOWN

#Partitions
PartitionName=_default 
Nodes=alone[02-08,10-13],clone[05-11,25-29,31-32,36-37,39-40,45] 
PriorityJobFactor=2 AllowAccounts=rth AllowGroups=rth ExclusiveUser=YES 
Default=NO DefaultTime=24:00:00 MaxTime=INFINITE State=UP
PartitionName=fastlane 
Nodes=alone[02-08,10-13],clone[05-11,25-29,31-32,36-37,39-40,45] 
PriorityJobFactor=10 AllowAccounts=rth AllowGroups=rth ExclusiveUser=YES 
Default=NO DefaultTime=24:00:00 MaxTime=INFINITE State=UP
PartitionName=panic
Nodes=alone[02-08,10-13,15],clone[05-11,25-29,31-32,36-37,39-40,45] 
PriorityJobFactor=100 AllowAccounts=rth AllowGroups=rth ExclusiveUser=YES 
Default=NO DefaultTime=24:00:00 MaxTime=INFINITE State=UP
PartitionName=free 
Nodes=alone[02-08,10-13],clone[05-11,25-29,31-32,36-37,39-40,45] 
PriorityJobFactor=1 AllowAccounts=ALL AllowGroups=ALL ExclusiveUser=YES 
Default=YES DefaultTime=24:00:00 MaxTime=INFINITE State=UP


Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread david vilanova
Hi,
I have updated the slurm.conf as follows:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=2
PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP

Still get testq node in down status ??? Any idea ?

Below log from db and controller:
==> /var/log/slurm/slurmctrl.log <==
[2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster
linuxcluster
[2017-11-29T16:28:30.850] error: SelectType specified more than once,
latest value used
[2017-11-29T16:28:30.851] layouts: no layout to initialize
[2017-11-29T16:28:30.855] layouts: loading entities/relations information
[2017-11-29T16:28:30.855] Recovered state of 1 nodes
[2017-11-29T16:28:30.855] Down nodes: linuxcluster
[2017-11-29T16:28:30.855] Recovered information about 0 jobs
[2017-11-29T16:28:30.855] cons_res: select_p_node_init
[2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions
[2017-11-29T16:28:30.856] Recovered state of 0 reservations
[2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not specified
[2017-11-29T16:28:30.856] cons_res: select_p_reconfigure
[2017-11-29T16:28:30.856] cons_res: select_p_node_init
[2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions
[2017-11-29T16:28:30.856] Running as primary controller
[2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd.
[2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set
[2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set.
[2017-11-29T16:29:31.169]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

David



El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald <
steffen.grunew...@aei.mpg.de> escribió:

> Hi David,
>
> On Wed, 2017-11-29 at 14:45:06 +, david vilanova wrote:
> > Hello,
> > I have installed latest 7.11 release and my node is shown as down.
> > I hava a single physical server with 12 cores so not sure the conf below
> is
> > correct ?? can you help ??
> >
> > In slurm.conf the node is configure as follows:
> >
> > NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1
> > ThreadsPerCore=1 Feature=local
>
> 12 Sockets? Certainly not... 12 Cores per socket, yes.
> (IIRC CPUS shouldn't be specified if the detailed topology is given.
> You may try CPUs=12 and drop the details.)
>
> > PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE
> State=UP
>^^ typo?
>
> Cheers,
>  Steffen
>
>


Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Philip Kovacs
Step back from slurm and confirm that MariaDb is up and responsive.
# mysql -uroot -pEnter password: Welcome to the MariaDB monitor.  Commands end 
with ; or \g.Your MariaDB connection id is 8Server version: 10.2.9-MariaDB 
MariaDB Server
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> select table_schema, table_name from 
information_schema.tables;
 

On Wednesday, November 29, 2017 10:17 AM, Bruno Santos 
 wrote:
 

 Hi Barbara,
This is a fresh install. I have installed slurm from source on Debian stretch 
and now trying to set it up correctly. MariaDB is running for but I am confused 
about the database configuration. I followed a tutorial (I can no longer find 
it) that showed me how to create the database and give it to the slurm user on 
mysql. Haven't really done anything further than that as running anything 
return the same errors:

root@plantae:~# sacctmgr show user -s
sacctmgr: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Initial RPC not DBD_INIT
sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
sacctmgr: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Initial RPC not DBD_INIT
sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
sacctmgr: error: slurm_persist_conn_open: Something happened with the 
receiving/processing of the persistent connection init message to 
localhost:6819: Initial RPC not DBD_INIT
sacctmgr: error: slurmdbd: Sending PersistInit msg: No error
sacctmgr: error: slurmdbd: DBD_GET_USERS failure: No error
 Problem with query.
 

On 29 November 2017 at 14:46, Barbara Krašovec  wrote:

Did you upgrade SLURM or is it a fresh install?
Are there any associations set? For instance, did you create the cluster with 
sacctmgr?sacctmgr add cluster 
Is mariadb/mysql server running, is slurmdbd running? Is it working? Try a 
simple test, such as:sacctmgr show user -sIf it was an upgrade, did you try to 
run the slurmdbd and slurmctld manuallly first:
slurmdbd -Dv
Then controller:
slurmctld -Dv
Which OS is that?Is there a firewall/selinux/ACLs?
Cheers,Barbara


On 29 Nov 2017, at 15:19, Bruno Santos  wrote:
Thank you Barbara, 
Unfortunately, it does not seem to be a munge problem. Munge can successfully 
authenticate with the nodes. 
I have increased the verbosity level and restarted the slurmctld and now I am 
getting more information about this:

Nov 29 14:08:16 plantae slurmctld[30340]: Registering slurmctld at port 6817 
with slurmdbd.

Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: 
Something happened with the receiving/processing of the persistent connection 
init message to localhost:6819: Initial RPC not DBD_INIT

Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit 
msg: No error

Nov 29 14:08:16 plantae slurmctld[30340]: error: slurm_persist_conn_open: 
Something happened with the receiving/processing of the persistent connection 
init message to localhost:6819: Initial RPC not DBD_INIT

Nov 29 14:08:16 plantae slurmctld[30340]: error: slurmdbd: Sending PersistInit 
msg: No error

Nov 29 14:08:16 plantae slurmctld[30340]: fatal: It appears you don't have any 
association data from your database.  The priority/multifactor plugin requires 
this information to run correctly.  Please check your database connection and 
try again.


The problem seems to somehow be related to slurmdbd?  I am a bit lost at this 
point, to be honest. 
Best,Bruno
On 29 November 2017 at 14:06, Barbara Krašovec  wrote:

Hello,
does munge work?Try if decode works locally:munge -n | unmungeTry if decode 
works remotely:munge -n | ssh  unmunge
It seems as munge keys do not match...
See comments inline..


On 29 Nov 2017, at 14:40, Bruno Santos  wrote:
I actually just managed to figure that one out. 
The problem was that I had setup AccountingStoragePass=magic in the slurm.conf 
file while after re-reading the documentation it seems this is only needed if I 
have a different munge instance controlling the logins to the database, which I 
don't. So commenting that line out seems to have worked however I am now 
getting a different error: 
Nov 29 13:19:20 plantae slurmctld[29984]: Registering slurmctld at port 6817 
with slurmdbd.
Nov 29 13:19:20 plantae slurmctld[29984]: error: slurm_persist_conn_open: 
Something happened with the receiving/processing of the persistent connection 
init message to localhost:6819: Initial RPC not DBD_INIT
Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Main process exited, 
code=exited, status=1/FAILURE
Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Unit entered failed 
state.
Nov 29 13:19:20 plantae systemd[1]: slurmctld.service: Failed with result 
'exit-code'.

My slurm.conf 

Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Merlin Hartley
Can you give us the output of 
# control show job 6982

Could be an issue with requesting too many CPUs or something…


Merlin
--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom

> On 29 Nov 2017, at 15:21, Christian Anthon  wrote:
> 
> Hi,
> 
> I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs 
> seems to be stuck in ReqNodeNotAvail:
> 
>   6982 panic  Morgensferro PD   0:00 1 
> (ReqNodeNotAvail, UnavailableNodes:)
>   6981 panic SPECferro PD   0:00 1 
> (ReqNodeNotAvail, UnavailableNodes:)
> 
> The nodes are fully allocated in terms of memory, but not all cpu resources 
> are consumed
> 
> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
> _default up   infinite 19mix 
> clone[05-11,25-29,31-32,36-37,39-40,45]
> _default up   infinite 11  alloc alone[02-08,10-13]
> fastlane up   infinite 19mix 
> clone[05-11,25-29,31-32,36-37,39-40,45]
> fastlane up   infinite 11  alloc alone[02-08,10-13]
> panicup   infinite 19mix 
> clone[05-11,25-29,31-32,36-37,39-40,45]
> panicup   infinite 12  alloc alone[02-08,10-13,15]
> free*up   infinite 19mix 
> clone[05-11,25-29,31-32,36-37,39-40,45]
> free*up   infinite 11  alloc alone[02-08,10-13]
> 
> Possibly relevant lines in slurm.conf (full slurm.conf attached)
> 
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_CPU_Memory
> TaskPlugin=task/none
> FastSchedule=1
> 
> Any advice?
> 
> Cheers, Christian.
> 
> 



Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Merlin Hartley
damn autocorrect - I meant:

# scontrol show job 6982



--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom

> On 29 Nov 2017, at 16:08, Merlin Hartley  
> wrote:
> 
> Can you give us the output of 
> # control show job 6982
> 
> Could be an issue with requesting too many CPUs or something…
> 
> 
> Merlin
> --
> Merlin Hartley
> Computer Officer
> MRC Mitochondrial Biology Unit
> Cambridge, CB2 0XY
> United Kingdom
> 
>> On 29 Nov 2017, at 15:21, Christian Anthon > > wrote:
>> 
>> Hi,
>> 
>> I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs 
>> seems to be stuck in ReqNodeNotAvail:
>> 
>>   6982 panic  Morgensferro PD   0:00 1 
>> (ReqNodeNotAvail, UnavailableNodes:)
>>   6981 panic SPECferro PD   0:00 1 
>> (ReqNodeNotAvail, UnavailableNodes:)
>> 
>> The nodes are fully allocated in terms of memory, but not all cpu resources 
>> are consumed
>> 
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> _default up   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> _default up   infinite 11  alloc alone[02-08,10-13]
>> fastlane up   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> fastlane up   infinite 11  alloc alone[02-08,10-13]
>> panicup   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> panicup   infinite 12  alloc alone[02-08,10-13,15]
>> free*up   infinite 19mix 
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> free*up   infinite 11  alloc alone[02-08,10-13]
>> 
>> Possibly relevant lines in slurm.conf (full slurm.conf attached)
>> 
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> TaskPlugin=task/none
>> FastSchedule=1
>> 
>> Any advice?
>> 
>> Cheers, Christian.
>> 
>> 
> 



Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Bruno Santos
Managed to do some more progress on this. The problem seems to be related
to somehow the service still linking to an older version of slurmdbd I had
installed with apt. I have now hopefully fully cleaned the old version but
when I try to start the service it is getting killed somehow. Any
suggestions?

[2017-11-29T16:15:16.778] debug3: Trying to load plugin
> /usr/local/lib/slurm/auth_munge.so
> [2017-11-29T16:15:16.778] debug:  Munge authentication plugin loaded
> [2017-11-29T16:15:16.778] debug3: Success.
> [2017-11-29T16:15:16.778] debug3: Trying to load plugin
> /usr/local/lib/slurm/accounting_storage_mysql.so
> [2017-11-29T16:15:16.780] debug2: mysql_connect() called for db slurm_db
> [2017-11-29T16:15:16.786] adding column federation after flags in table
> cluster_table
> [2017-11-29T16:15:16.786] adding column fed_id after federation in table
> cluster_table
> [2017-11-29T16:15:16.786] adding column fed_state after fed_id in table
> cluster_table
> [2017-11-29T16:15:16.786] adding column fed_weight after fed_state in
> table cluster_table
> [2017-11-29T16:15:16.786] debug:  Table cluster_table has changed.
> Updating...
> [2017-11-29T16:15:17.259] debug:  Table txn_table has changed.  Updating...
> [2017-11-29T16:15:17.781] debug:  Table tres_table has changed.
> Updating...
> [2017-11-29T16:15:18.325] debug:  Table acct_coord_table has changed.
> Updating...
> [2017-11-29T16:15:18.783] debug:  Table acct_table has changed.
> Updating...
> [2017-11-29T16:15:19.252] debug:  Table res_table has changed.  Updating...
> [2017-11-29T16:15:20.267] debug:  Table clus_res_table has changed.
> Updating...
> [2017-11-29T16:15:20.762] debug:  Table qos_table has changed.  Updating...
> [2017-11-29T16:15:21.272] debug:  Table user_table has changed.
> Updating...
> [2017-11-29T16:15:22.079] Accounting storage MYSQL plugin loaded
> [2017-11-29T16:15:22.080] debug3: Success.
> [2017-11-29T16:15:22.083] debug2: ArchiveDir= /tmp
> [2017-11-29T16:15:22.083] debug2: ArchiveScript = (null)
> [2017-11-29T16:15:22.083] debug2: AuthInfo  = (null)
> [2017-11-29T16:15:22.083] debug2: AuthType  = auth/munge
> [2017-11-29T16:15:22.083] debug2: CommitDelay   = 0
> [2017-11-29T16:15:22.083] debug2: DbdAddr   = 10.1.10.37
> [2017-11-29T16:15:22.083] debug2: DbdBackupHost = (null)
> [2017-11-29T16:15:22.083] debug2: DbdHost   = plantae
> [2017-11-29T16:15:22.083] debug2: DbdPort   = 6819
> [2017-11-29T16:15:22.083] debug2: DebugFlags= (null)
> [2017-11-29T16:15:22.083] debug2: DebugLevel= 7
> [2017-11-29T16:15:22.083] debug2: DefaultQOS= (null)
> [2017-11-29T16:15:22.083] debug2: LogFile   =
> /slurm/log/slurmdbd.log
> [2017-11-29T16:15:22.083] debug2: MessageTimeout= 10
> [2017-11-29T16:15:22.083] debug2: PidFile   =
> /slurm/run/slurmdbd.pid
> [2017-11-29T16:15:22.083] debug2: PluginDir = /usr/local/lib/slurm
> [2017-11-29T16:15:22.083] debug2: PrivateData   = none
> [2017-11-29T16:15:22.083] debug2: PurgeEventAfter   = NONE

[2017-11-29T16:15:22.083] debug2: PurgeJobAfter = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeResvAfter= NONE
> [2017-11-29T16:15:22.083] debug2: PurgeStepAfter= NONE
> [2017-11-29T16:15:22.083] debug2: PurgeSuspendAfter = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeTXNAfter = NONE
> [2017-11-29T16:15:22.083] debug2: PurgeUsageAfter = NONE
> [2017-11-29T16:15:22.083] debug2: SlurmUser = slurm(64030)
> [2017-11-29T16:15:22.083] debug2: StorageBackupHost = (null)
> [2017-11-29T16:15:22.083] debug2: StorageHost   = localhost
> [2017-11-29T16:15:22.083] debug2: StorageLoc= slurm_db
> [2017-11-29T16:15:22.083] debug2: StoragePort   = 3306
> [2017-11-29T16:15:22.083] debug2: StorageType   =
> accounting_storage/mysql
> [2017-11-29T16:15:22.083] debug2: StorageUser   = slurm
> [2017-11-29T16:15:22.083] debug2: TCPTimeout= 2
> [2017-11-29T16:15:22.083] debug2: TrackWCKey= 0
> [2017-11-29T16:15:22.083] debug2: TrackSlurmctldDown= 0
> [2017-11-29T16:15:22.083] debug2: acct_storage_p_get_connection: request
> new connection 1
> [2017-11-29T16:15:22.086] slurmdbd version 17.02.9 started
> [2017-11-29T16:15:22.086] debug2: running rollup at Wed Nov 29 16:15:22
> 2017
> [2017-11-29T16:15:22.086] debug2: Everything rolled up
> [2017-11-29T16:16:46.798] Terminate signal (SIGINT or SIGTERM) received
> [2017-11-29T16:16:46.798] debug:  rpc_mgr shutting down
> [2017-11-29T16:16:46.799] debug3: starting mysql cleaning up
> [2017-11-29T16:16:46.799] debug3: finished mysql cleaning up




On 29 November 2017 at 15:13, Bruno Santos  wrote:

> Hi Barbara,
>
> This is a fresh install. I have installed slurm from source on Debian
> stretch and now trying to set it up correctly.
> MariaDB is running for but I am confused about the database configuration.
> I followed a tutorial (I can no longer find it) that showed me how to
> create the database and give

Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Benjamin Redling



On 11/29/17 4:32 PM, david vilanova wrote:

Hi,
I have updated the slurm.conf as follows:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=2
PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP

Still get testq node in down status ??? Any idea ?

Below log from db and controller:
==> /var/log/slurm/slurmctrl.log <==
[2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster 
linuxcluster
[2017-11-29T16:28:30.850] error: SelectType specified more than once, 


... if it says so you 

Have you checked your slurm.conf?

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323



Re: [slurm-users] slurm conf with single machine with multi cores.

2017-11-29 Thread Le Biot, Pierre-Marie
Hello David,

So linuxcluster is the Head node and also a Compute node ?

Is slurmd running ?

What does /var/log/slurm/slurmd.log say ?

Regards,
Pierre-Marie Le Biot


From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
david vilanova
Sent: Wednesday, November 29, 2017 4:33 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] slurm conf with single machine with multi cores.

Hi,
I have updated the slurm.conf as follows:

SelectType=select/cons_res
SelectTypeParameters=CR_CPU
NodeName=linuxcluster CPUs=2
PartitionName=testq Nodes=linuxcluster Default=YES MaxTime=INFINITE State=UP

Still get testq node in down status ??? Any idea ?

Below log from db and controller:
==> /var/log/slurm/slurmctrl.log <==
[2017-11-29T16:28:30.446] slurmctld version 17.11.0 started on cluster 
linuxcluster
[2017-11-29T16:28:30.850] error: SelectType specified more than once, latest 
value used
[2017-11-29T16:28:30.851] layouts: no layout to initialize
[2017-11-29T16:28:30.855] layouts: loading entities/relations information
[2017-11-29T16:28:30.855] Recovered state of 1 nodes
[2017-11-29T16:28:30.855] Down nodes: linuxcluster
[2017-11-29T16:28:30.855] Recovered information about 0 jobs
[2017-11-29T16:28:30.855] cons_res: select_p_node_init
[2017-11-29T16:28:30.855] cons_res: preparing for 1 partitions
[2017-11-29T16:28:30.856] Recovered state of 0 reservations
[2017-11-29T16:28:30.856] _preserve_plugins: backup_controller not specified
[2017-11-29T16:28:30.856] cons_res: select_p_reconfigure
[2017-11-29T16:28:30.856] cons_res: select_p_node_init
[2017-11-29T16:28:30.856] cons_res: preparing for 1 partitions
[2017-11-29T16:28:30.856] Running as primary controller
[2017-11-29T16:28:30.856] Registering slurmctld at port 6817 with slurmdbd.
[2017-11-29T16:28:31.098] No parameter for mcs plugin, default values set
[2017-11-29T16:28:31.098] mcs: MCSParameters = (null). ondemand set.
[2017-11-29T16:29:31.169] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

David



El El mié, 29 nov 2017 a las 15:59, Steffen Grunewald 
mailto:steffen.grunew...@aei.mpg.de>> escribió:
Hi David,

On Wed, 2017-11-29 at 14:45:06 +, david vilanova wrote:
> Hello,
> I have installed latest 7.11 release and my node is shown as down.
> I hava a single physical server with 12 cores so not sure the conf below is
> correct ?? can you help ??
>
> In slurm.conf the node is configure as follows:
>
> NodeName=linuxcluster CPUs=1 RealMemory=991 Sockets=12 CoresPerSocket=1
> ThreadsPerCore=1 Feature=local

12 Sockets? Certainly not... 12 Cores per socket, yes.
(IIRC CPUS shouldn't be specified if the detailed topology is given.
You may try CPUs=12 and drop the details.)

> PartitionName=testq Nodes=inuxcluster Default=YES MaxTime=INFINITE State=UP
   ^^ typo?

Cheers,
 Steffen


Re: [slurm-users] fail when trying to set up selection=con_res

2017-11-29 Thread Ethan Van Matre
We do have hyperthreading enabled. Here are some log extracts fomr various 
attempts to get it working.


[2017-11-28T15:52:30.466] error: we don't have select plugin type 101

[2017-11-28T15:52:30.466] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-28T15:52:30.466] error: Malformed RPC of type REQUEST_ABORT_JOB(6013) 
received

[2017-11-28T15:52:30.466] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-28T15:52:30.476] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T15:58:55.683] error: we don't have select plugin type 101

[2017-11-28T15:58:55.683] error: select_g_select_jobinfo_unpack: unpack error


[2017-11-28T16:02:21.490] error: we don't have select plugin type 101

[2017-11-28T16:02:21.490] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-28T16:02:21.490] error: Malformed RPC of type 
REQUEST_TERMINATE_JOB(6011) received

[2017-11-28T16:02:21.490] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-28T16:02:21.491] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.491] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.492] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.492] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.493] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.496] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.496] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.498] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:02:21.500] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-28T16:03:21.535] error: we don't have select plugin type 101



At one point using linear ( I think) I was able to get 4 jobs to run at once on 
this node/ We have 40 cpu


[2017-11-28T16:37:36.023] _run_prolog: run job script took usec=4

[2017-11-28T16:37:36.023] _run_prolog: prolog with lock for job 6637 ran for 0 
seconds

[2017-11-28T16:37:36.023] Launching batch job 6637 for UID 1000

[2017-11-28T16:37:36.024] _run_prolog: run job script took usec=4

[2017-11-28T16:37:36.024] _run_prolog: prolog with lock for job 6638 ran for 0 
seconds

[2017-11-28T16:37:36.024] _run_prolog: run job script took usec=5

[2017-11-28T16:37:36.024] _run_prolog: prolog with lock for job 6639 ran for 0 
seconds

[2017-11-28T16:37:36.025] _run_prolog: run job script took usec=4

[2017-11-28T16:37:36.025] _run_prolog: prolog with lock for job 6640 ran for 0 
seconds

[2017-11-28T16:37:36.030] Launching batch job 6640 for UID 1000

[2017-11-28T16:37:36.037] Launching batch job 6638 for UID 1000

[2017-11-28T16:37:36.044] Launching batch job 6639 for UID 1000

[2017-11-28T16:38:18.011] [6639] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 
status 59648

[2017-11-28T16:38:18.011] [6638] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 
status 59648

[2017-11-28T16:38:18.011] [6640] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 
status 59648

[2017-11-28T16:38:18.012] [6637] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 
status 59648

[2017-11-28T16:38:18.015] [6640] done with job

[2017-11-28T16:38:18.015] [6639] done with job

[2017-11-28T16:38:18.015] [6638] done with job

[2017-11-28T16:38:18.015] [6637] done with job





Ethan VanMatre

Informatics Research Analyst
Institute on Development and Disability
Oregon Health & Science University
CSLU - GH

[slurm-users] '--x11' or no '--x11' when using srun when both methods work for X11 graphical applications

2017-11-29 Thread Kevin Manalo
Hello SLURM users?,

I was reviewing the X11 documentation

https://slurm.schedmd.com/faq.html#terminal
https://slurm.schedmd.com/faq.html#x11

15. Can tasks be launched with a remote terminal?
In Slurm version 1.3 or higher, use srun's --pty option. Until then, you can 
accomplish this by starting an appropriate program or script. In the simplest 
case (X11 over TCP with the DISPLAY environment already set), executing srun 
xterm may suffice.

At our site, indeed this is sufficient.  We are using the  
https://github.com/hautreux/slurm-spank-x11 plugin currently.

I see that  in the 17.11.0 announcement

  -- X11 support is now fully integrated with the main Slurm code.
Remove any
 X11 plugin configured in your plugstack.conf file to avoid errors being
 logged about conflicting options.

Questions


  1.  If 'srun -x11' is not needed to X11 forward (the simplest case works), do 
we encourage users to use it?  I'm more in need of understanding how it works 
because some users use it, some do not, and more education on this would be 
great.
  2.  Is the SLURM spank x11 plugin now unnecessary once we build 17.11.0 with 
the updated configuration?

Thanks,
Kevin



Re: [slurm-users] fail when trying to set up selection=con_res

2017-11-29 Thread Ethan Van Matre
Here is some more data:

Changed slurm.conf to have


SelectType=select/cons_res

SelectTypeParameters=CR_CPU

Then restarted

 sudo systemctl restart slurmctld.service

The log on the host said:


[2017-11-29T12:23:56.384] error: we don't have select plugin type 101

[2017-11-29T12:23:56.384] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:23:56.384] error: Malformed RPC of type REQUEST_ABORT_JOB(6013) 
received

[2017-11-29T12:23:56.384] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received


Then did a sudo scontrol reconfigure and the log said:


[2017-11-29T12:23:56.394] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:24:34.889] Message aggregation disabled

[2017-11-29T12:24:34.890] Resource spec: Reserved system memory limit not 
configured for this node

Sview had running jobs cleard out of its context (they are still running) But I 
kinda expect that.

I then submitted 6 jobs to the partition that do nothing but sleep and the log 
says:


[2017-11-29T12:25:39.424] error: we don't have select plugin type 101

[2017-11-29T12:25:39.424] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.424] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received

[2017-11-29T12:25:39.424] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.424] error: we don't have select plugin type 101

[2017-11-29T12:25:39.424] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.424] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received

[2017-11-29T12:25:39.424] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.424] error: we don't have select plugin type 101

[2017-11-29T12:25:39.424] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.424] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received

[2017-11-29T12:25:39.424] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.424] error: we don't have select plugin type 101

[2017-11-29T12:25:39.424] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.424] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received

[2017-11-29T12:25:39.424] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.425] error: we don't have select plugin type 101

[2017-11-29T12:25:39.425] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.425] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received

[2017-11-29T12:25:39.425] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.425] error: we don't have select plugin type 101

[2017-11-29T12:25:39.425] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.425] error: Malformed RPC of type 
REQUEST_BATCH_JOB_LAUNCH(4005) received

[2017-11-29T12:25:39.425] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.434] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:25:39.434] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:25:39.434] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:25:39.434] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:25:39.435] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:25:39.435] error: service_connection: slurm_receive_msg: Header 
lengths are longer than data received

[2017-11-29T12:25:39.436] error: we don't have select plugin type 101

[2017-11-29T12:25:39.436] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.436] error: Malformed RPC of type 
REQUEST_TERMINATE_JOB(6011) received

[2017-11-29T12:25:39.436] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.436] error: we don't have select plugin type 101

[2017-11-29T12:25:39.436] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.436] error: Malformed RPC of type 
REQUEST_TERMINATE_JOB(6011) received

[2017-11-29T12:25:39.436] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received

[2017-11-29T12:25:39.436] error: we don't have select plugin type 101

[2017-11-29T12:25:39.436] error: select_g_select_jobinfo_unpack: unpack error

[2017-11-29T12:25:39.436] error: Malformed RPC of type 
REQUEST_TERMINATE_JOB(6011) received

[2017-11-29T12:25:39.436] error: slurm_receive_msg_and_forward: Header lengths 
are longer than data received


Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Christian Anthon
Thanks,

I believe the user must have resubmitted the job, hence the updated id.

Cheers, Christian

JobId=6986 JobName=Morgens
   UserId=ferro(2166) GroupId=ferro(22166) MCS_label=N/A
   Priority=1031 Nice=0 Account=rth QOS=normal
   JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:
Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2017-11-29T21:02:38 EligibleTime=2017-11-29T21:02:38
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=panic AllocNode:Sid=rnai01:5765
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=32000,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)



> Can you give us the output of
> # control show job 6982
>
> Could be an issue with requesting too many CPUs or something…
>
>
> Merlin
> --
> Merlin Hartley
> Computer Officer
> MRC Mitochondrial Biology Unit
> Cambridge, CB2 0XY
> United Kingdom
>
>> On 29 Nov 2017, at 15:21, Christian Anthon  wrote:
>>
>> Hi,
>>
>> I have a problem with a newly setup slurm-17.02.7-1.el6.x86_64 that jobs
>> seems to be stuck in ReqNodeNotAvail:
>>
>>   6982 panic  Morgensferro PD   0:00 1
>> (ReqNodeNotAvail, UnavailableNodes:)
>>   6981 panic SPECferro PD   0:00 1
>> (ReqNodeNotAvail, UnavailableNodes:)
>>
>> The nodes are fully allocated in terms of memory, but not all cpu
>> resources are consumed
>>
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> _default up   infinite 19mix
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> _default up   infinite 11  alloc alone[02-08,10-13]
>> fastlane up   infinite 19mix
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> fastlane up   infinite 11  alloc alone[02-08,10-13]
>> panicup   infinite 19mix
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> panicup   infinite 12  alloc alone[02-08,10-13,15]
>> free*up   infinite 19mix
>> clone[05-11,25-29,31-32,36-37,39-40,45]
>> free*up   infinite 11  alloc alone[02-08,10-13]
>>
>> Possibly relevant lines in slurm.conf (full slurm.conf attached)
>>
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_CPU_Memory
>> TaskPlugin=task/none
>> FastSchedule=1
>>
>> Any advice?
>>
>> Cheers, Christian.
>>
>> 
>
>




Re: [slurm-users] '--x11' or no '--x11' when using srun when both methods work for X11 graphical applications

2017-11-29 Thread Matthieu Hautreux
Hi Kevin,

Based on my understanding and a discussion with the SLURM dev team on that
subject, here are some information about the new support of X11 in
slurm-17.11 :

- slurm's native support of X11 forwarding is based on libssh2
- slurm's native support of X11 can be disabled at configure/compilation
time using the --disable-x11 configure pragma
- slurm-spank-x11 can be used only if you disable slurm's native support of
X11 at configure/compilation time

The simplest case with --pty available since 1.3 only works if you do not
really want to secure your X11 forwarding (using xhost +...).

libssh2 yet offers a subset of the capabilities of openssh (less encryption
methods, less authentication methods, ...). If you need options only
available in openssh, you should use slurm-spank-x11 instead of slurm's
native support of X11. That is what we are doing with 17.11 as we need
GSSAPI support for kerberos authentication, a support that is not provided
by libssh2 right now.

Take a look at https://www.libssh2.org/ to figure out if the provided
features is sufficient for you. If it is the case, I guess that using
slurm's native support of X11 will be easier than having to compile,
install and configure slurm-spank-x11.

Looking at the code in current slurm-17.11
(src/slurmd/slurmstepd/x11_forwarding.c),
the logic behind slurm's native support of X11 differs from slurm-spank-x11
when it comes to establish the ssh connections required to secure the X11
forwarding. If my understandings are correct, native logic is to redirect a
local port on the compute node to the X11 port on the submission/login node
using libssh2 direct-tcpip channels (equivalent to -L %port:%host:%port in
openssh) and set that local port as the X11_DISPLAY to use locally by the
applications. Hostbased authentication and pubkey authentication are the
two authentication logics that are tried. So you need to have either
hostbased authentication configured on your cluster or pubkey/private keys
available and configured for all of your users in their home directories,
readable on the compute nodes after seteuid/setegid.


HTH.

Matthieu

Le 29 nov. 2017 21:20, "Kevin Manalo"  a écrit :

> Hello SLURM users​,
>
>
>
> I was reviewing the X11 documentation
>
>
>
> https://slurm.schedmd.com/faq.html#terminal
>
> https://slurm.schedmd.com/faq.html#x11
>
>
>
> *15. Can tasks be launched with a remote terminal?*
> * In Slurm version 1.3 or higher, use srun's --pty option. Until then, you
> can accomplish this by starting an appropriate program or script. In the
> simplest case (X11 over TCP with the DISPLAY environment already set),
> executing srun xterm may suffice.*
>
>
>
> At our site, indeed this is sufficient.  We are using the
> https://github.com/hautreux/slurm-spank-x11 plugin currently.
>
>
>
> I see that  in the 17.11.0 announcement
>
>
>
> *  -- X11 support is now fully integrated with the main Slurm code. *
>
>
> * Remove any   X11 plugin configured in your plugstack.conf file to
> avoid errors being   logged about conflicting options. *
>
>
>
> Questions
>
>
>
>1. If ‘srun —x11’ is not needed to X11 forward (the simplest case
>works), do we encourage users to use it?  I’m more in need of understanding
>how it works because some users use it, some do not, and more education on
>this would be great.
>2. Is the SLURM spank x11 plugin now unnecessary once we build 17.11.0
>with the updated configuration?
>
>
>
> Thanks,
>
> Kevin
>
>
>


[slurm-users] Show job command after completion with sacct

2017-11-29 Thread Jacob Chappell
All,

Using "scontrol show jobid X" I can see info about running jobs, including
the command used to launch the job, the user's working directory, values of
stdout, stdin, stderr, etc. With Slurm accounting configured, sacct seems
to show *some* of this information about jobs that have completed. However,
I don't seem to see the "Command" or "WorkDir" field in sacct. Is it
possible to record this information also for jobs that have completed?

Thanks,
__
*Jacob D. Chappell*
*Research Computing Associate*
Research Computing | Research Computing Infrastructure
Information Technology Services | University of Kentucky
301 Rose Street | 102 James F. Hardymon Building
Lexington, KY 40506-0495
jacob.chapp...@uky.edu

Visit us: www.uky.edu/ITS
How are we doing? Send Feedback to itsabout...@uky.edu
ITS . . . it’s about technology. ITS . . . it’s about innovation.  ITS . .
. it’s about you!


[slurm-users] Slurm version 17.11.0 is now available [PMIx with UCX]

2017-11-29 Thread Artem Polyakov
Dear friends and colleagues

On behalf of Mellanox HPC R&D I would like to emphasize a feature that we
introduced in Slurm 17.11 that has been show [1] to significantly improve
the speed and scalability of Slurm jobstart.

Starting from this release PMIx plugin supports:
(a) Direct point-to-point connections (Direct-connect) for Out-Of-Band
(OOB) communications. Prior to 17.11 it was using Slurm RPC mechanism that
is very convenient but has some performance-related issues. According to
our measurements this significantly improves Slurm/PMIx performance in the
direct-modex case [1]. By default this mode is turned on and is using
TCP-based implementation.
(b) If Slurm is configured with UCX (http://www.openucx.org/) communication
framework, PMIx plugin will use UCX-based implementation of the
Direct-connect.
(c) "Early-wireup" option to pre-connect Slurm step daemons before an
application starts using OOB channel.

The codebase was extensively tested by us internally but we need a broader
testing and looking forward hearing from you about your experience.

This implementation demonstrated good results on a small scale [1]. We are
currently working on obtaining larger-scale results and invite any
interested parties to collaborate. Please contact me through artemp at
mellanox.com if you are interested.

For testing purposes you can use our recently released jobstart project
that we are using internally for development: https://github.com/artpol84/
jobstart. It provides a convenient way to deploy as regular user a testing
Slurm instance  inside the allocation from a legacy Slurm managing the
cluster. Other good thing about this project is that it "bash-documents"
the way we configure HPC software stack and can be used as a reference.

Some technical details about those features:
1. To build with PMIx and UCX libraries you will need to explicitly
configure with both PMIx and UCX:
$ ./configure --with-pmix= --with-ucx=

2. You can select whether Direct-connect is enabled or not using
`SLURM_PMIX_DIRECT_CONN={true|false}` environment variable (envar) on
per-jobstep basis. By default TCP-based Direct-connect is on, if Slurm
wasn't configured with UCX.

3. If UCX support was turned on during configuration, UCX is used by
default for Direct-connect. You can control whether or not UCX is used
through `SLURM_PMIX_DIRECT_CONN_UCX={true|false}` envar. If UCX wasn't
enabled this envar is ignored.

4. To enable UCX from the very first OOB communication we added the
Early-wireup option that pre-connects UCX-based communication tree in
parallel with the local portion of MPI/OSHMEM application initialization.
By default this feature is turned off and can be controlled using
`SLURM_PMIX_DIRECT_CONN_EARLY={true | false}`. As we will get confident
with this feature we are planning to turn it on by default.

5. You may also want to specify UCX network device (i.e.
UCX_NET_DEVICES=mlx5_0:1) and the transport (UCX_TLS=dc). For now it is
recommended to use DC as a transport for the jobstart. Full RC support will
be implemented soon. Currently you have to set the global envar (like
UCX_TLS) but in the next release we will introduce prefixed envars (like
UCX_SLURM_TLS and UCX_SLURM_NET_DEVICES) for a finer grained control over
communication resource usage.

In the presentation [1] you will also find 2 backup slides explaining how
you can enable point-to-point and collectives micro-benchmarks integrated
into the PMIx plugin to get some basic reference number for the performance
on your system.
Jobstart project also contains a simple OSHMEM hello world applications
that measures oshmem_init time.

[1] Slides that was presented at Slurm booth at SC17:
https://slurm.schedmd.com/SC17/Mellanox_Slurm_pmix_UCX_backend_v4.pdf.


Best regards, Artem Y. Polyakov
Sr. Engineer SW, Mellanox Technologies Inc.


Re: [slurm-users] Show job command after completion with sacct

2017-11-29 Thread Chris Samuel

On 30/11/17 8:57 am, Jacob Chappell wrote:

Using "scontrol show jobid X" I can see info about running jobs, 
including the command used to launch the job, the user's working 
directory, values of stdout, stdin, stderr, etc.


Note that the announcement for 17.11.0 mentions that the job script
will no longer appear in "scontrol show job", it's been moved to
scontrol write batch_file $FILE (don't know if that means you can
write it to stdout any more).

With Slurm accounting 
configured, sacct seems to show *some* of this information about jobs 
that have completed. However, I don't seem to see the "Command" or 
"WorkDir" field in sacct. Is it possible to record this information also 
for jobs that have completed?


I don't believe that is recorded in the database, you'd need
to request that as a feature request from SchedMD.

Best of luck!
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC



[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-29 Thread Andy Riebs
We've just installed 17.11.0 on our 100+ node x86_64 cluster running 
CentOS 7.4 this afternoon, and periodically see a single node (perhaps 
the first node in an allocation?) get drained with the message "batch 
job complete failure".


On one node in question, slurmd.log reports

   pam_unix(slurm:session): open_session - error recovering username
   pam_loginuid(slurm:session): unexpected response from failed
   conversation function 


On another node drained for the same reason,

   error: pam_open_session: Cannot make/remove an entry for the
   specified session
   error: error in pam_setup
   error: job_manager exiting abnormally, rc = 4020
   sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0

slurmctld has logged

   error: slurmd error running JobId=33 on node(s)=node048: Slurmd
   could not execve job

   drain_nodes: node Summer0c048 state set to DRAIN

It's been a long day (for other reasons), so I'll go dig into this 
tomorrow. But if anyone can shine some light on where I should start 
looking, I shall be most obliged!


Andy

--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!



Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Chris Samuel
On Thursday, 30 November 2017 3:26:25 AM AEDT Bruno Santos wrote:

> Managed to do some more progress on this. The problem seems to be related to
> somehow the service still linking to an older version of slurmdbd I had
> installed with apt. I have now hopefully fully cleaned the old version but
> when I try to start the service it is getting killed somehow. Any
> suggestions?

Are you starting it with systemctl?  If so it might be taking too long for 
systemd's liking to upgrade the tables and it might kill it.

You might need to start it by hand first, let it upgrade the tables, and then 
you can stop it and then start it from systemctl.

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC




Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Chris Samuel
On Thursday, 30 November 2017 5:28:26 PM AEDT Chris Samuel wrote:

> Are you starting it with systemctl?  If so it might be taking too long for
> systemd's liking to upgrade the tables and it might kill it.

Ignore that - I skimmed your logs too quickly!

[2017-11-29T16:15:22.086] slurmdbd version 17.02.9 started
[2017-11-29T16:15:22.086] debug2: running rollup at Wed Nov 29 16:15:22 2017
[2017-11-29T16:15:22.086] debug2: Everything rolled up
[2017-11-29T16:16:46.798] Terminate signal (SIGINT or SIGTERM) received

So it started correctly, then ran for 24 seconds before something shut it 
down.  It didn't crash (from what I can see).

You can run it by hand as:

/path/to/slurmdbd -Dvvv

to see what it logs in a more verbose way, whilst running it in the 
foreground.

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC




Re: [slurm-users] jobs stuck in ReqNodeNotAvail,

2017-11-29 Thread Chris Samuel
On Thursday, 30 November 2017 2:21:36 AM AEDT Christian Anthon wrote:

> The nodes are fully allocated in terms of memory, but not all cpu
> resources are consumed

I suspect that's your problem, the job wants 16 cores on a single node and 
32GB of RAM free.   If you've got no RAM free it's not going to be able to 
run.

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC