Re: [slurm-users] Change ExcNodeList on a running job

Ole Holm Nielsen Fri, 05 Jun 2020 00:44:50 -0700

Hi Geoffrey,

I'm just curious as to what causes a user to decide that a given nodehas an issue? If a node is healthy in all respects, why would a userdecide not to use the node?

We can certainly perform all sorts of node health checks from Slurm byconfiguring the use of LBNL Node Health Check[1]. Items such as diskfull, network interface down, memory removed, etc. can be checked for.Slurm will offline any node that fails a NHC check, and no jobs will bestarted on that node until the condition has been cleared.


I have some suggestions about NHC usage in my Slurm Wiki:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check

Best regards,
Ole

[1] https://github.com/mej/nhc


On 04-06-2020 23:37, Ransom, Geoffrey M. wrote:

Not quite.
The user’s job script in question is checking the error status of theprogram it ran while it is running. If a program fails the running jobwants to exclude the machine it is currently running on and requeueitself in case it died due to a local machine issue that the schedulerhas not flagged as a problem.
The current goal is to have a running job step in an array job add thecurrent host to its exclude list and requeue itself when it detects aproblem. I can’t seem to modify the exclude list while a job is running,but once the task is requeued and back in the queue it is no longerrunning so it can’t modify its own exclude list.
I.e…. put something like the following into a sbatch script so each taskcan run it against itself.
If ! $runprogram $args ; then

   NewExcNodeList=”$ ExcNodeList,$HOSTNAME”

   scontrol update job ${SLURM_JOB_ID} ExcNodeList=$NewExcNodeList

   scontrol requeue ${ SLURM_JOB_ID}

   sleep 10

fi
*From:*slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of*Rodrigo Santibáñez
*Sent:* Thursday, June 4, 2020 4:16 PM
*To:* Slurm User Community List <slurm-users@lists.schedmd.com>
*Subject:* [EXT] Re: [slurm-users] Change ExcNodeList on a running job
*APL external email warning: *Verify senderslurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com> before clicking links orattachments
Hello,
Jobs can be requeue if something wrong happens, and the node withfailure excluded by the controller.
*--requeue*
Specifies that the batch job should eligible to being requeue. The jobmay be requeued explicitly by a system administrator, after nodefailure, or upon preemption by a higher priority job. When a job isrequeued, the batch script is initiated from its beginning. Also see the*--no-requeue* option. The /JobRequeue/ configuration parameter controlsthe default behavior on the cluster.
Also, jobs can be run selecting a specific node or excluding nodes

*-w*, *--nodelist*=</node name list/>
Request a specific list of hosts. The job will contain /all/ of thesehosts and possibly additional hosts as needed to satisfy resourcerequirements. The list may be specified as a comma-separated list ofhosts, a range of hosts (host[1-5,7,...] for example), or a filename.The host list will be assumed to be a filename if it contains a "/"character. If you specify a minimum node or processor count larger thancan be satisfied by the supplied host list, additional resources will beallocated on other nodes as needed. Duplicate node names in the listwill be ignored. The order of the node names in the list is notimportant; the node names will be sorted by Slurm.
*-x*, *--exclude*=</node name list/>

Explicitly exclude certain nodes from the resources granted to the job.

does this help?
El jue., 4 jun. 2020 a las 16:03, Ransom, Geoffrey M.(<geoffrey.ran...@jhuapl.edu <mailto:geoffrey.ran...@jhuapl.edu>>) escribió:
    Hello

        We are moving from Univa(sge) to slurm and one of our users has
    jobs that if they detect a failure on the current machine they add
    that machine to their exclude list and requeue themselves. The user
    wants to emulate that behavior in slurm.

    It seems like “scontrol update job ${SLURM_JOB_ID} ExcNodeList
    $NEWExcNodeList” won’t work on a running job, but it does work on a
    job pending in the queue. This means the job can’t do this step and
    requeue itself to avoid running on the same host as before.

    Our user wants his jobs to be able to exclude the current node and
    requeue itself.

    Is there some way to accomplish this in slurm?

    Is there a requeue counter of some sort so a job can see if it has
    requeued itself more than X times and give up?

Re: [slurm-users] Change ExcNodeList on a running job

Reply via email to