[slurm-users] The 8 second default for sbatch's --get-user-env: is it "the only default"
We have a group of users who occasionally report seeing jobs start without, for example, $HOME being set. Looking at the slurmd logs (info level) on our 20.11.8 node, shows the first instance of an afflicted JobID appearing as [2022-03-11T00:19:35.117] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: but then notes Slurm becoming aware that it couldn't get a user environment at [2022-03-11T00:21:36.254] error: Failed to load current user environment variables [2022-03-11T00:21:36.254] error: _get_user_env: Unable to get user's local environment, running only with passed environment [2022-03-11T00:21:36.254] Launching batch job for UID so that's 2 minutes. I'm not aware of "us", ie, us on the system's side, nor the users in question, overridng what the sbatch man page says is the --get-user-env[=timeout][mode] timeout default, of 8 seconds, anywhere. Is it possible that, if the sbatch option is invoked at all, there's a "fallback" timeout value that get "inherited" into what then appears to be athe option specific timeout, although even then the only 120 seconds we have in the config is: SlurmctldTimeout= 120 sec and I'm thinking that it's the job on the node, so under the control of the SlurmD, for which the timeout is 300 sec, and not the SlurmCtld, that's waiting for the user-env ? I'd like to suggest that the afflicted members of our user community try using a --get-user-env=timeout with a "larger" figure, just to be on the safe side, but my "8 seconds" vs "2 minutes" observation has got me wondering where, in time, a "safe side" might need to start, or whether I am missing something else entirely. As usual, any clues/pointers welcome, Kevin -- Supercomputing Systems Administrator Pawsey Supercomputing Centre
Re: [slurm-users] job requesting licenses would not be scheduled as expected
Depending on other variables, it is fine. The 7 license job cannot run because there are only 5 available, so that one has to wait. Since there are 5 available, the 1 license job can run, so it does. That is the simple view. Other variables such as job time could affect that. Brian Andrus On 3/14/2022 11:52 PM, 刘文晓 wrote: Hi there: licenses requesting job would not be scheduled as expected In my local environment--Slurm19.05, I have 2 computing nodes( 2 CPUs per node) and 40 fluent licenses. the envs are same result: Licenses=fluent:40 SchedulerType=sched/backfill PriorityType=priority/multifactor or Licenses=fluent:40 SchedulerType=sched/builtin PriorityType=priority/basic the test steps are below: 1. submit 35 licenses and 1 CPU's job, it is running; 2. submit 7 licenses and 1 CPU's job, it is pending(licenses) with bigger priority than step 3's job; 3. submit 1 license and 1 CPU's job, it is running; In my view, it is not right. The 7 licenses' job with higher priority should be running before 1 license's job. is my question right? thanks
Re: [slurm-users] job requesting licenses would not be scheduled as expected
This is what I would expect. It doesn't matter that the 7 license job has higher priority as it in ineligible to run due to the lack of licenses. The scheduler moves on and prioritises (and starts) the next eligible job. I will suggest a workaround. You could periodically run a script that looks for jobs requesting licenses (running and queued) and have the script make decisions and take action. I think the only easy action that is relevant in this case is to hold or release jobs. You might want to submit jobs with a hold (or do that via a submit plugin) to avoid newly submitted jobs sneaking past held jobs before your script takes action. By the way, where only cores and nodes were considered, the highest priority job would block out resource to run as soon as possible and lower priority jobs could backfill around it. It would be complicated to make backfill consider licenses too and I doubt it has been or will be done - but I've not actually checked code or documentation. Gareth From: slurm-users On Behalf Of ??? Sent: Tuesday, 15 March 2022 5:52 PM To: slurm-us...@schedmd.com Subject: [slurm-users] job requesting licenses would not be scheduled as expected Hi there: licenses requesting job would not be scheduled as expected In my local environment--Slurm19.05, I have 2 computing nodes( 2 CPUs per node) and 40 fluent licenses. the envs are same result: Licenses=fluent:40 SchedulerType=sched/backfill PriorityType=priority/multifactor or Licenses=fluent:40 SchedulerType=sched/builtin PriorityType=priority/basic the test steps are below: 1. submit 35 licenses and 1 CPU's job, it is running; 2. submit 7 licenses and 1 CPU's job, it is pending(licenses) with bigger priority than step 3's job; 3. submit 1 license and 1 CPU's job, it is running; In my view, it is not right. The 7 licenses' job with higher priority should be running before 1 license's job. is my question right? thanks