Hi Alexander:

This is a great case for using Node Health Check (https://github.com/mej/nhc).  
We use this so that each node periodically runs an admin-selected set of tests 
(e.g. "is /work readable?"), and automatically Drains a node which fails any of 
them, and puts the reason in the node's Reason attribute, and can be set to 
Resume the node upon a future successful test run, or not.  We use NHC in this 
way to protect jobs from starting when there's filesystem trouble.  Jobs retain 
their priority and pend properly until the nodes report the filesystem is 
available again.

As another option I think you could use Slurm 'licenses' to control dispatch to 
nodes depending on filesystem availability.  For example, assign the cluster 
99000 of feature type 'license' called e.g. 'scratch_lic', and use 
job_submit.lua (or the users) to cause scratch-requesting jobs to request a 
scratch_lic also.  You won't need to care how many scratch_lic are available as 
long as you start it with a number much larger than your max possible 
concurrent job count.  All you need to do is set it to either zero, or that 
large number, with 'scontrol' whenever you want to enable or disable the 
launching of the filesystem-using subset of jobs.  That could be automated if 
you had a reliable test you could run outside of Slurm, which ran 'scontrol' as 
needed.  You wouldn't have to change any node parameters, and no submissions 
would be rejected based on filesystem availability (since the license stuff 
can't affect job submission, only dispatch).

I'm sure there could be other solutions.  I've not thought further on this 
since I've been happily using NHC for a long time.

==
Paul Brunk, system administrator
Georgia Advanced Resource Computing Center
Enterprise IT Svcs, the University of Georgia


On 2/1/22, 4:59 AM, "slurm-users" <slurm-users-boun...@lists.schedmd.com> wrote:
I hope someone is out there having some experience with the
"ActiveFeatures" and "AvailableFeatures" in the node configuration and
can give some advise.

We have configured 4 nodes with certain features, e.g.

"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUTot=96 CPULoad=44.98
    AvailableFeatures=work,scratch
    ActiveFeatures=work,scratch

..."

The features are obviously filesystems mounted. Now we are going to take
away one filesystem (work) for maintenance. Therefore we wanted to take
away the feature from the nodes. We tried e.g.

# scontrol update node=thin1 ActiveFeatures="scratch"

resulting in

"NodeName=thin1 Arch=x86_64 CoresPerSocket=24
    CPUAlloc=0 CPUTot=96 CPULoad=44.98
    AvailableFeatures=work,scratch
    ActiveFeatures=scratch

..."

The problem now is that no jobs can be SUBMITTED requesting the feature
work, the error we get is

"sbatch: error: Batch job submission failed: Requested node
configuration is not available"


Does this make sense? We want our users to submit jobs requesting
features that are available in general because maintenances usually
don't last too long and the users want to submit jobs for the time once
the feature is available again since we have rather long queuing times.
I understand that jobs might be rejected when the feature is not
available at all but not when it is not active?! Furthermore, also 4
node jobs get rejected at submission when the feature is only active on
3 nodes. Is this a bug? Wouldn't it make more sense that the job just
sits in the queue waiting for the features/resources to be activated again?

Maybe someone has an idea how to handle this problem?

Thanks,

Alexander








Reply via email to