[slurm-users] Raise the priority of a certain kind of jobs

2020-11-12 Thread SJTU
Hello,

We want to raise the priority of a certain kind of slurm jobs. We considered 
doing it in Prolog, but Prolog seems to run only at job starting time so may 
not be useful for queued jobs. Is there any possible way to do this?

Thank you and look forward to your reply.


Best,

Jianwen

Re: [slurm-users] Raise the priority of a certain kind of jobs

2020-11-12 Thread Ole Holm Nielsen

On 11/12/20 10:58 AM, SJTU wrote:

Hello,

We want to raise the priority of a certain kind of slurm jobs. We 
considered doing it in Prolog, but Prolog seems to run only at job 
starting time so may not be useful for queued jobs. Is there any possible 
way to do this?


You can add a negative "nice value" to the job, for example:

scontrol update jobid=10208 nice=-1

See the scontrol manual page http://slurm.schedmd.com/scontrol.html

/Ole



Re: [slurm-users] Raise the priority of a certain kind of jobs

2020-11-12 Thread Marcus Boden
Hi,

you could write a job_submit plugin:
https://slurm.schedmd.com/job_submit_plugins.html

The Site factor was added to priority for that exact reason.

Best,
Marcus

On 11/12/20 10:58 AM, SJTU wrote:
> Hello,
> 
> We want to raise the priority of a certain kind of slurm jobs. We considered 
> doing it in Prolog, but Prolog seems to run only at job starting time so may 
> not be useful for queued jobs. Is there any possible way to do this?
> 
> Thank you and look forward to your reply.
> 
> 
> Best,
> 
> Jianwen
> 

-- 
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Raise the priority of a certain kind of jobs

2020-11-12 Thread Zacarias Benta
You can create a QOS with more priority, you can also create a specific 
partition with highter priority.


On 12/11/2020 09:58, SJTU wrote:

Hello,

We want to raise the priority of a certain kind of slurm jobs. We 
considered doing it in Prolog, but Prolog seems to run only at job 
starting time so may not be useful for queued jobs. Is there any 
possible way to do this?


Thank you and look forward to your reply.


Best,

Jianwen

--

*Cumprimentos / Best Regards,*

Zacarias Benta
INCD @ LIP - Universidade do Minho

INCD Logo



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] failed to send msg type 6002: No route to host

2020-11-12 Thread Patrick Bégou
Hi slurm admins and developpers,

no one has an idea about this problem ?

Still investigating this morning I discover that it works from the
management node (a small VM running slurmctld) even if I have no home
directory on it (I use a su command from root to gain unprivileged user
setup). It still doesn't run from the login node even with all firewall
disabled :-(

Patrick

Le 10/11/2020 à 11:54, Patrick Bégou a écrit :
>
> Hi,
>
> I'm new to slurm (as admin) and I need some help. Testing my initial
> setup with:
>
> [begou@tenibre ~]$ *salloc -n 1 sh*
> salloc: Granted job allocation 11
> sh-4.4$ *squeue*
>  JOBID PARTITION NAME USER ST   TIME 
> NODES NODELIST(REASON)
>     *11 *  all   sh    begou  R  
> 0:16  1 tenibre-0-0
> sh-4.4$*srun /usr/bin/hostname*
> srun: error: timeout waiting for task launch, started 0 of 1 tasks
> srun: Job step 11.0 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to
> finish.
> srun: error: Timed out waiting for job step to complete
>
> I check the connections:
>
> *tenibre is the login node* (no daemon running)
>
> nc -v tenibre-0-0 6818
> nc -v management1 6817
>
> *management1 is the management node* (slurmctld running)
>
> nc -v tenibre-0-0 6818
>
> *tenibre-0-0 is the first compute node* (slurmd running)
>
> nc -v management1 6817
>
> All tests return "/Ncat: Connected.../"
>
> The command "id begou" works on all nodes and I can reach my home
> directory on the login node and on the compute node.
>
> On the compute node slurmd.log shows:
>
> [2020-11-10T11:21:38.050]*launch task* *11.0 *request from
> UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
> [2020-11-10T11:21:38.050] debug:  Checking credential with 508
> bytes of sig data
> [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
> [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11
> ran for 0 seconds
> [2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
> [2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
> [2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE
> plugin loaded
> [2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin
> loaded
> [2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
> [2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather
> NOT_INVOKED plugin loaded
> [2020-11-10T11:21:38.054] [11.0] debug:  Message thread started
> pid = 12099
> [2020-11-10T11:21:38.054] debug:  task_p_slurmd_reserve_resources:
> 11 0
> [2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded:
> checkpoint/none
> [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin
> loaded
> [2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin
> loaded
> [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
> [2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate:
> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
> [2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin
> stack /etc/slurm/plugstack.conf
> [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
> [2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
> [2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info:
> SLURM_STEP_RESV_PORTS not found in env
> [2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
> [2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent
> thread
> [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host*
> [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route
> to host
> [2020-11-10T11:21:38.069] [11.0] debug: 
> step_terminate_monitor_stop signaling condition
> [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
> abnormally, rc = 4021
> [2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
> [2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5
> *failed to send msg type 6002: No route to host*
> [2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5
> failed to send msg type 6002: No route to host
> [2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5
> failed to send msg type 6002: No route to host
> [2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5
> failed to send msg type 6002: No route to host
> [2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5
> failed to send msg type 6002: No route to host
> [2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5
> failed to send msg type 6002: No route to host
> [2020-11-10T11:21:40.372] [11.0] debug:  Messag

Re: [slurm-users] failed to send msg type 6002: No route to host

2020-11-12 Thread Marcus Wagner

Hi Patrick,

for me at least, this is running as expected.

I'm not sure, why you use "sh" as the command for salloc, I never saw that before. If you 
do not provide a command, the users default shell will be started if the 
"SallocDefaultCommand" is not set within slurm.conf


So, what does
$> salloc -n 1
$> srun hostname

and what does
$> salloc -n 1 srun hostname


Best
Marcus


P.S.:

increase debugging might also help, e.g.

$> srun -v hostname

Am 10.11.2020 um 11:54 schrieb Patrick Bégou:

Hi,

I'm new to slurm (as admin) and I need some help. Testing my initial setup with:

[begou@tenibre ~]$ *salloc -n 1 sh*
salloc: Granted job allocation 11
sh-4.4$ *squeue*
  JOBID PARTITION NAME USER ST   TIME NODES 
NODELIST(REASON)
*11 *  all   sh    begou  R 0:16  1 tenibre-0-0
sh-4.4$*srun /usr/bin/hostname*
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 11.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

I check the connections:

*tenibre is the login node* (no daemon running)

nc -v tenibre-0-0 6818
nc -v management1 6817

*management1 is the management node* (slurmctld running)

nc -v tenibre-0-0 6818

*tenibre-0-0 is the first compute node* (slurmd running)

nc -v management1 6817

All tests return "/Ncat: Connected.../"

The command "id begou" works on all nodes and I can reach my home directory on 
the login node and on the compute node.

On the compute node slurmd.log shows:

[2020-11-10T11:21:38.050]*launch task* *11.0 *request from UID:23455 
GID:1036 HOST:172.30.1.254 PORT:42220
[2020-11-10T11:21:38.050] debug:  Checking credential with 508 bytes of sig 
data
[2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for 
0 seconds
[2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather NOT_INVOKED 
plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug:  Message thread started pid = 12099
[2020-11-10T11:21:38.054] debug:  task_p_slurmd_reserve_resources: 11 0
[2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded: 
checkpoint/none
[2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
[2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/freezer/slurm' already exists
[2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin stack 
/etc/slurm/plugstack.conf
[2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
[2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
[2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info: 
SLURM_STEP_RESV_PORTS not found in env
[2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
[2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent thread
[2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host*
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host
[2020-11-10T11:21:38.069] [11.0] debug: step_terminate_monitor_stop 
signaling condition
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc 
= 4021
[2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
[2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5 *failed 
to send msg type 6002: No route to host*
[2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited
[2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thread exit
[2020-11-10T11:21:40.372] [11.0] *done with job*


But I do not understand what this "No route to host" means.


Thanks for your h

Re: [slurm-users] failed to send msg type 6002: No route to host

2020-11-12 Thread Sean Maxwell
Hi Patrick,

I have seen a similar error while configuring native X-forwarding in Slurm.
It was caused by Slurm sending an IP to the compute node (as part of a
message) that was not routable back to the controller host. In my case it
was because the controller host was multihomed, and I had misconfigured
ControlMachine= in slurm.conf to a hostname associated with the wrong
network interface. If your controller host has multiple network interfaces,
you might want to check that all IPs associated with the controller have
routes back from the compute node.

-Sean

On Thu, Nov 12, 2020 at 7:40 AM Patrick Bégou <
patrick.be...@legi.grenoble-inp.fr> wrote:

> Hi slurm admins and developpers,
>
> no one has an idea about this problem ?
>
> Still investigating this morning I discover that it works from the
> management node (a small VM running slurmctld) even if I have no home
> directory on it (I use a su command from root to gain unprivileged user
> setup). It still doesn't run from the login node even with all firewall
> disabled :-(
>
> Patrick
>
> Le 10/11/2020 à 11:54, Patrick Bégou a écrit :
>
> Hi,
>
> I'm new to slurm (as admin) and I need some help. Testing my initial setup
> with:
>
> [begou@tenibre ~]$ *salloc -n 1 sh*
> salloc: Granted job allocation 11
> sh-4.4$ *squeue*
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
> *11 *  all   shbegou  R   0:16  1
> tenibre-0-0
> sh-4.4$* srun /usr/bin/hostname*
> srun: error: timeout waiting for task launch, started 0 of 1 tasks
> srun: Job step 11.0 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
> I check the connections:
>
> *tenibre is the login node* (no daemon running)
>
> nc -v tenibre-0-0 6818
> nc -v management1 6817
>
> *management1 is the management node* (slurmctld running)
>
> nc -v tenibre-0-0 6818
>
> *tenibre-0-0 is the first compute node* (slurmd running)
>
> nc -v management1 6817
>
> All tests return "*Ncat: Connected...*"
>
> The command "id begou" works on all nodes and I can reach my home
> directory on the login node and on the compute node.
>
> On the compute node slurmd.log shows:
>
> [2020-11-10T11:21:38.050]* launch task* *11.0 *request from UID:23455
> GID:1036 HOST:172.30.1.254 PORT:42220
> [2020-11-10T11:21:38.050] debug:  Checking credential with 508 bytes of
> sig data
> [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
> [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for
> 0 seconds
> [2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
> [2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
> [2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE plugin loaded
> [2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin loaded
> [2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
> [2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather NOT_INVOKED
> plugin loaded
> [2020-11-10T11:21:38.054] [11.0] debug:  Message thread started pid = 12099
> [2020-11-10T11:21:38.054] debug:  task_p_slurmd_reserve_resources: 11 0
> [2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded:
> checkpoint/none
> [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
> [2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate: cgroup
> '/sys/fs/cgroup/freezer/slurm' already exists
> [2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin stack
> /etc/slurm/plugstack.conf
> [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
> [2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
> [2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info:
> SLURM_STEP_RESV_PORTS not found in env
> [2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
> [2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent thread
> [2020-11-10T11:21:38.069] [11.0]* error: connect io: No route to host*
> [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host
> [2020-11-10T11:21:38.069] [11.0] debug:  step_terminate_monitor_stop
> signaling condition
> [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc
> = 4021
> [2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
> [2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5 *failed
> to send msg type 6002: No route to host*
> [2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:38.770] [11.0] debug:

Re: [slurm-users] failed to send msg type 6002: No route to host

2020-11-12 Thread Pocina, Goran
I think this message can also happen if the slurm.conf on your login node is 
missing the entry for the slurmd node.  2020 versions have a way to automate 
sync of the configuration.

From: slurm-users  On Behalf Of Patrick 
Bégou
Sent: Thursday, November 12, 2020 7:38 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] failed to send msg type 6002: No route to host


This message was sent by an external party.

Hi slurm admins and developpers,

no one has an idea about this problem ?

Still investigating this morning I discover that it works from the management 
node (a small VM running slurmctld) even if I have no home directory on it (I 
use a su command from root to gain unprivileged user setup). It still doesn't 
run from the login node even with all firewall disabled :-(

Patrick

Le 10/11/2020 à 11:54, Patrick Bégou a écrit :

Hi,

I'm new to slurm (as admin) and I need some help. Testing my initial setup with:
[begou@tenibre ~]$ salloc -n 1 sh
salloc: Granted job allocation 11
sh-4.4$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
11   all   shbegou  R   0:16  1 tenibre-0-0
sh-4.4$ srun /usr/bin/hostname
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 11.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

I check the connections:

tenibre is the login node (no daemon running)
nc -v tenibre-0-0 6818
nc -v management1 6817
management1 is the management node (slurmctld running)
nc -v tenibre-0-0 6818
tenibre-0-0 is the first compute node (slurmd running)

nc -v management1 6817

All tests return "Ncat: Connected..."

The command "id begou" works on all nodes and I can reach my home directory on 
the login node and on the compute node.

On the compute node slurmd.log shows:
[2020-11-10T11:21:38.050] launch task 11.0 request from UID:23455 GID:1036 
HOST:172.30.1.254 PORT:42220
[2020-11-10T11:21:38.050] debug:  Checking credential with 508 bytes of sig data
[2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for 0 
seconds
[2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather NOT_INVOKED 
plugin loaded
[2020-11-10T11:21:38.054] [11.0] debug:  Message thread started pid = 12099
[2020-11-10T11:21:38.054] debug:  task_p_slurmd_reserve_resources: 11 0
[2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded: 
checkpoint/none
[2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin loaded
[2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
[2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/freezer/slurm' already exists
[2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin stack 
/etc/slurm/plugstack.conf
[2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
[2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
[2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info: 
SLURM_STEP_RESV_PORTS not found in env
[2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
[2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent thread
[2020-11-10T11:21:38.069] [11.0] error: connect io: No route to host
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host
[2020-11-10T11:21:38.069] [11.0] debug:  step_terminate_monitor_stop signaling 
condition
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc = 
4021
[2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
[2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5 failed to 
send msg type 6002: No route to host
[2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited
[2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thre

Re: [slurm-users] failed to send msg type 6002: No route to host

2020-11-12 Thread Patrick Bégou
Hi Marcus

thanks for your contact. I'm new to slurm deployment and I do not
remember where I found this command to check slurm setup. The
SallocDefaultCommand is not defined in my slurm.conf file

That is strange for me is that it works on the node hosting slurmctld,
and on the compute node too.

On the compute node, connected as root and then using "su - begou":

[root@tenibre-0-0 ~]# *su - begou*
Last login: Tue Nov 10 20:49:45 CET 2020 on pts/0
[begou@tenibre-0-0 ~]$ *sinfo*
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
equipment_typeC    up   infinite  1   idle tenibre-0-0
all*   up   infinite  1   idle tenibre-0-0
[begou@tenibre-0-0 ~]$ *squeue*
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
[begou@tenibre-0-0 ~]$ *salloc -n 1 srun hostname *
salloc: Granted job allocation 45
tenibre-0-0
salloc: Relinquishing job allocation 45
[begou@tenibre-0-0 ~]$

On the management node, connected as root and then using "su - begou"
(with no home directory available):

[root@management1 ~]# *su - begou*
Creating home directory for begou.
Last login: Thu Nov 12 12:43:47 CET 2020 on pts/1
su: warning: cannot change directory to /HA/sources/begou: No such
file or directory
[begou@management1 root]$ *sinfo*
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
equipment_typeC    up   infinite  1   idle tenibre-0-0
all*   up   infinite  1   idle tenibre-0-0
[begou@management1 root]$ *squeue*
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
[begou@management1 root]$ *salloc -n 1 srun hostname *
salloc: Granted job allocation 46
slurmstepd: error: couldn't chdir to `/root': Permission denied:
going to /tmp instead
tenibre-0-0
salloc: Relinquishing job allocation 46
[begou@management1 root]$

But not on the login node where I need it


Le 12/11/2020 à 14:05, Marcus Wagner a écrit :
>
> for me at least, this is running as expected.
>
> I'm not sure, why you use "sh" as the command for salloc, I never saw
> that before. If you do not provide a command, the users default shell
> will be started if the "SallocDefaultCommand" is not set within
> slurm.conf
> So, what does
> $> salloc -n 1
> $> srun hostname
*This command hangs**
*
> **
> and what does
> $> salloc -n 1 srun hostname
>
*this command hangs too* from the login node.*
*
> **
> Best
> Marcus
>
>
> P.S.:
>
> increase debugging might also help, e.g.
>
> $> srun -v hostname
>
Yes I try this but wasn't able to find pertinent information. *This is
what I get*:


[begou@tenibre ~]$ *salloc -n 1 "srun  -v hostname"*
salloc: Granted job allocation 43
salloc: error: _fork_command: Unable to find command "srun  -v
hostname"
salloc: Relinquishing job allocation 43
[begou@tenibre ~]$ salloc -n 1 srun  -v hostname
salloc: Granted job allocation 44
srun: defined options
srun:  
srun: (null)  : tenibre-0-0
srun: jobid   : 44
srun: job-name    : srun
srun: nodes   : 1
srun: ntasks  : 1
srun: verbose : 5
srun:  
srun: end of defined options
srun: debug:  propagating RLIMIT_CPU=18446744073709551615
srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
srun: debug:  propagating RLIMIT_DATA=18446744073709551615
srun: debug:  propagating RLIMIT_STACK=8388608
srun: debug:  propagating RLIMIT_CORE=18446744073709551615
srun: debug:  propagating RLIMIT_RSS=18446744073709551615
srun: debug:  propagating RLIMIT_NPROC=512946
srun: debug:  propagating RLIMIT_NOFILE=1024
srun: debug:  propagating RLIMIT_MEMLOCK=65536
srun: debug:  propagating RLIMIT_AS=18446744073709551615
srun: debug:  propagating SLURM_PRIO_PROCESS=0
srun: debug:  propagating UMASK=0022
srun: debug2: srun PMI messages to port=44969
srun: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
srun: debug:  Munge authentication plugin loaded
srun: debug3: Success.
srun: jobid 44: nodes(1):`tenibre-0-0', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug:  requesting job 44, user 23455, nodes 1 including ((null))
srun: debug:  cpus 1, tasks 1, name hostname, relative 65534
srun: CpuBindType=(null type)
srun: debug:  Entering slurm_step_launch
srun: debug:  mpi type = (null)
srun: debug:  Using mpi/none
srun: debug:  Entering _msg_thr_create()
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 10
srun: debug3: eio_message_socket_readable: shutdown 0 fd 6
srun: debug:  initialized stdio listening socket, port 34531
srun: debug:  Started IO serve

[slurm-users] Slurm versions 20.02.6 and 19.05.8 are now available (CVE-2020-27745 and CVE-2020-27746)

2020-11-12 Thread Tim Wickberg
Slurm versions 20.11.0rc2, 20.02.6 and 19.05.8 are now available, and 
include a series of recent bug fixes, as well as a fix for two security 
issues.


Note: the 19.05 release series is nearing the end of it's support 
lifecycle as we prepare to release 20.11 later this month. The 19.05.8 
download link is under the 'Older Versions' page.


SchedMD customers were informed on October 29th and provided patches on 
request; this process is documented in our security policy [1].


CVE-2020-27745:
A review of Slurm's RPC handling code uncovered a potential buffer 
overflow with one utility function. The only affected use is in Slurm's 
PMIx MPI plugin, and a job would only be vulnerable if --mpi=pmix was 
requested, or the site has set MpiDefault=pmix in slurm.conf.


CVE-2020-27746:
Slurm's use of the 'xauth' command to manage X11 magic cookies can lead 
to an inadvertent disclosure of a user's cookie when setting up X11 
forwarding on a node. An attacker monitoring /proc on the node could 
race the setup and steal the magic cookie, which may let them connect to 
that user's X11 session. A job would only be impacted if --x11 was 
requested at submission time. This was reported by Jonas Stare (NSC).


Downloads are available at https://www.schedmd.com/downloads.php .

Release notes follow below.

- Tim

[1] https://www.schedmd.com/security.php

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support


* Changes in Slurm 20.11.0rc2
==
 -- MySQL - Remove potential race condition when sending updates to a cluster
and commit_delay used.
 -- Fixed regression in rc1 where sinfo et al would not show a node in a resv
state.
 -- select/linear will now allocate up to nodes RealMemory when configured with
SelectTypeParameters=CR_Memory and --mem=0 specified. Previous behavior was
no memory accouted and no memory limits implied to job.
 -- Remove unneeded lock check from running the slurmctld prolog for a job.
 -- Fix duplicate key error on clean starts after slurmctld is killed.
 -- Avoid double free of step_record_t in the slurmctld when node is removed
from config.
 -- Zero out step_record_t's magic when freed.
 -- Fix sacctmgr clearing QosLevel when trailing comma is used.
 -- slurmrestd - fix a fatal() error when connecting over IPv6.
 -- slurmrestd - add API to interface with slurmdbd.
 -- mpi/cray_shasta - fix PMI port parsing for non-contiguous port ranges.
 -- squeue and sinfo -O no longer repeat the last suffix specified.
 -- cons_tres - fix regression regarding gpus with --cpus-per-task.
 -- Avoid non-async-signal-safe functions calls in X11 fowarding which can
lead to the extern step terminating unexpectedly.
 -- Don't send job completion email for revoked federation jobs.
 -- Fix device or resource busy errors on cgroup cleanup on older kernels.
 -- Avoid binding to IPv6 wildcard address in slurmd if IPv6 is not explicitly
enabled.
 -- Make ntasks_per_gres work with cpus_per_task.
 -- Various alterations in reference to ntasks_per_tres.
 -- slurmrestd - multiple changes to make Slurm's OpenAPI spec compatible with
https://openapi-generator.tech/.
 -- nss_slurm - avoid loading slurm.conf to avoid issues on configless systems,
or systems with config files loaded on shared storage.
 -- scrontab - add cli_filter hooks.
 -- job_submit/lua - expose a "cron_job" flag to identify jobs submitted
through scrontab.
 -- PMIx - fix potential buffer overflows from use of unpackmem().
CVE-2020-27745.
 -- X11 forwarding - fix potential leak of the magic cookie when sent as an
argument to the xauth command. CVE-2020-27746.



* Changes in Slurm 20.02.6
==
 -- Fix sbcast --fanout option.
 -- Tighten up keyword matching for --dependency.
 -- Fix "squeue -S P" not sorting by partition name.
 -- Fix segfault in slurmctld if group resolution fails during job credential
creation.
 -- sacctmgr - Honor PreserveCaseUser when creating users with load command.
 -- Avoid attempting to schedule jobs on magnetic reservations when they aren't
allowed.
 -- Always make sure we clear the magnetic flag from a job.
 -- In backfill avoid NULL pointer dereference.
 -- Fix Segfault at end of slurmctld if you have a magnetic reservation and
you shutdown the slurmctld.
 -- Silence security warning when a Slurm is trying a job for a
magnetic reservation.
 -- Have sacct exit correctly when a user/group id isn't valid.
 -- Remove extra \n from invalid user/group id error message.
 -- Detect when extern steps trigger OOM events and mark extern step correctly.
 -- pam_slurm_adopt - permit root access to the node before reading the config
file, which will give root a chance to fix the config if missing or broken.
 -- Reset DefMemPerCPU, MaxMemPerCPU, and TaskPluginParam (among other minor
flags) on reconfigure.
 -- Fix incorrect memory handling of mail_user when updating mail_type=none.
 -- Hand

[slurm-users] can't lengthen my jobs log

2020-11-12 Thread john abignail
Hi,

My jobs database empties after about 1 day. "sacct -a" returns no results.
I've tried to lengthen that, but have been unsuccessful. I've tried adding
the following to slurmdbd.conf and restarting slurmdbd:
ArchiveJobs=yes
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
No job archives appear (in the default /tmp dir) either. What I'd like to
do is have the slurm database retain information on jobs for at least a few
weeks, writing out data beyond that threshold to files, but mainly I just
want to keep job data in the database for longer.

Regards,
John


Re: [slurm-users] can't lengthen my jobs log

2020-11-12 Thread Sebastian T Smith
Hi John,

Have you tried specifying a start time?  The default is 00:00:00 of the current 
day (depending on other options).  Example:

sacct -S 2020-11-01T00:00:00

Our accounting database retains all job data from the epoch of our system.

Best,

Sebastian

--

[University of Nevada, Reno]
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050
email: stsm...@unr.edu
website: http://rc.unr.edu


From: slurm-users  on behalf of john 
abignail 
Sent: Thursday, November 12, 2020 12:57 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] can't lengthen my jobs log

Hi,

My jobs database empties after about 1 day. "sacct -a" returns no results. I've 
tried to lengthen that, but have been unsuccessful. I've tried adding the 
following to slurmdbd.conf and restarting slurmdbd:
ArchiveJobs=yes
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
No job archives appear (in the default /tmp dir) either. What I'd like to do is 
have the slurm database retain information on jobs for at least a few weeks, 
writing out data beyond that threshold to files, but mainly I just want to keep 
job data in the database for longer.

Regards,
John


Re: [slurm-users] can't lengthen my jobs log

2020-11-12 Thread Erik Bryer
That worked pretty well in that I got more data than I ever have before by a 
lot. It only goes back about 18 days, but I'm not sure why. The slurmdbd.conf 
back then contained no directives on retaining logs, which is supposed to mean 
it defaults to retaining them indefinitely. On another test cluster it shows 
records back 2 days, which is about when I started fiddling with the settings. 
Could that have wiped the previous records, if they existed, or have my changes 
started the saving of older data. Still, this is progress.

Erik

From: slurm-users  on behalf of 
Sebastian T Smith 
Sent: Thursday, November 12, 2020 2:32 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] can't lengthen my jobs log

Hi John,

Have you tried specifying a start time?  The default is 00:00:00 of the current 
day (depending on other options).  Example:

sacct -S 2020-11-01T00:00:00

Our accounting database retains all job data from the epoch of our system.

Best,

Sebastian

--

[University of Nevada, Reno]
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050
email: stsm...@unr.edu
website: http://rc.unr.edu


From: slurm-users  on behalf of john 
abignail 
Sent: Thursday, November 12, 2020 12:57 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] can't lengthen my jobs log

Hi,

My jobs database empties after about 1 day. "sacct -a" returns no results. I've 
tried to lengthen that, but have been unsuccessful. I've tried adding the 
following to slurmdbd.conf and restarting slurmdbd:
ArchiveJobs=yes
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
No job archives appear (in the default /tmp dir) either. What I'd like to do is 
have the slurm database retain information on jobs for at least a few weeks, 
writing out data beyond that threshold to files, but mainly I just want to keep 
job data in the database for longer.

Regards,
John