I'm just curious as to what causes a user to decide that a given node has 
an issue? 
     If a node is healthy in all respects, why would a user decide not to use 
the node?

Not enough free TMPDIR space, a GPU starts having memory errors, or a machine 
with a temporary issue that slurm health checks are not tracking at the time so 
it can blackhole jobs.

But honestly, this is less about dealing with actual technical problems and 
more about dealing with keeping users happy as we help port their existing 
Univa jobs to slurm. We have a user with a run script that will add the local 
node to the exclude list and requeue itself up to 5 times if it thinks the 
program it launched is not running correctly because of a machine issue. I 
could emulate this behavior easily if the running job could update its own 
ExcNodeList and requeue itself. I can have a job requeue itself (just sleep 
after the scontrol command as the requeue is not instant) but slurm does not 
seem to let me update ExcNodeList on a running job.

Thanks for your suggestions.

Reply via email to