[slurm-users] The 8 second default for sbatch's --get-user-env: is it "the only default"

2022-03-15 Thread Kevin Buckley

We have a group of users who occasionally report seeing jobs start without,
for example, $HOME being set.

Looking at the slurmd logs (info level) on our 20.11.8 node, shows the first
instance of an afflicted JobID appearing as

[2022-03-11T00:19:35.117] task/affinity: task_p_slurmd_batch_request: 
task_p_slurmd_batch_request: 

but then notes Slurm becoming aware that it couldn't get a user environment at

[2022-03-11T00:21:36.254] error: Failed to load current user environment 
variables
[2022-03-11T00:21:36.254] error: _get_user_env: Unable to get user's local 
environment, running only with passed environment
[2022-03-11T00:21:36.254] Launching batch job  for UID 

so that's 2 minutes.

I'm not aware of "us", ie, us on the system's side, nor the users in question,
overridng what the sbatch man page says is the

  --get-user-env[=timeout][mode]

timeout default, of 8 seconds, anywhere.

Is it possible that, if the sbatch option is invoked at all, there's a 
"fallback"
timeout value that get "inherited" into what then appears to be athe option 
specific
timeout, although even then the only 120 seconds we have in the config is:

SlurmctldTimeout= 120 sec

and I'm thinking that it's the job on the node, so under the control of the
SlurmD, for which the timeout is 300 sec, and not the SlurmCtld, that's
waiting for the user-env ?

I'd like to suggest that the afflicted members of our user community try using a

--get-user-env=timeout

with a "larger" figure, just to be on the safe side, but my "8 seconds" vs
"2 minutes"  observation has got me wondering where, in time, a "safe side"
might need to start, or whether I am missing something else entirely.

As usual, any clues/pointers welcome,
Kevin

--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



Re: [slurm-users] job requesting licenses would not be scheduled as expected

2022-03-15 Thread Brian Andrus

Depending on other variables, it is fine.

The 7 license job cannot run because there are only 5 available, so that 
one has to wait.

Since there are 5 available, the 1 license job can run, so it does.

That is the simple view. Other variables such as job time could affect that.

Brian Andrus

On 3/14/2022 11:52 PM, 刘文晓 wrote:

Hi there:
licenses requesting job would not be scheduled as expected
In my local environment--Slurm19.05, I have 2 computing nodes( 2 CPUs 
per node) and 40 fluent licenses.

the envs are same result:
  Licenses=fluent:40
  SchedulerType=sched/backfill
  PriorityType=priority/multifactor
or
  Licenses=fluent:40
  SchedulerType=sched/builtin
  PriorityType=priority/basic

the test steps are below:
1. submit 35 licenses and 1 CPU's job, it is running;
2. submit 7 licenses and 1 CPU's job, it is pending(licenses) with 
bigger priority than step 3's job;

3. submit 1 license and 1 CPU's job, it is running;

In my view, it is not right. The 7 licenses' job with higher priority 
should be running before 1 license's job.

is my question right?

thanks






Re: [slurm-users] job requesting licenses would not be scheduled as expected

2022-03-15 Thread Williams, Gareth (IM&T, Black Mountain)
This is what I would expect. It doesn't matter that the 7 license job has 
higher priority as it in ineligible to run due to the lack of licenses. The 
scheduler moves on and prioritises (and starts) the next eligible job.

I will suggest a workaround.  You could periodically run a script that looks 
for jobs requesting licenses (running and queued) and have the script make 
decisions and take action. I think the only easy action that is relevant in 
this case is to hold or release jobs. You might want to submit jobs with a hold 
(or do that via a submit plugin) to avoid newly submitted jobs sneaking past 
held jobs before your script takes action.

By the way, where only cores and nodes were considered, the highest priority 
job would block out resource to run as soon as possible and lower priority jobs 
could backfill around it. It would be complicated to make backfill consider 
licenses too and I doubt it has been or will be done - but I've not actually 
checked code or documentation.

Gareth

From: slurm-users  On Behalf Of ???
Sent: Tuesday, 15 March 2022 5:52 PM
To: slurm-us...@schedmd.com
Subject: [slurm-users] job requesting licenses would not be scheduled as 
expected

Hi there:
licenses requesting job would not be scheduled as expected
In my local environment--Slurm19.05, I have 2 computing nodes( 2 CPUs per node) 
and 40 fluent licenses.
the envs are same result:
  Licenses=fluent:40
  SchedulerType=sched/backfill
  PriorityType=priority/multifactor
or
  Licenses=fluent:40
  SchedulerType=sched/builtin
  PriorityType=priority/basic

the test steps are below:
1. submit 35 licenses and 1 CPU's job, it is running;
2. submit 7 licenses and 1 CPU's job, it is pending(licenses) with bigger 
priority than step 3's job;
3. submit 1 license and 1 CPU's job, it is running;

In my view, it is not right. The 7 licenses' job with higher priority should be 
running before 1 license's job.
is my question right?

thanks