Hi,

I would like to setup a queing system for multiple users with limited 
resources. I'll have only 1 node and 48 cpus to work with. So I am using 
select/cons_res for select type.
Have to use preemption because there are many jobs running with 3 different 
partitions with different priorities. The gang,suspend preemption works fine 
however it limits my suspended jobs to a few cause the average job memory 
consumption is pretty high.

I read about BCLR and DMTCP checkpointing and got the impression that it has a 
huge overhead and maybe not quite ready yet.

The jobs we run here has application level checkpointing (like abaqus,ansys 
etc. ) I am wondering would there be way to incorporate application level 
checkpoint/restart features to slurms preemption features using bash scripts.
Namely :

·        A low priority job would be checkpointed, canceled and requeued.

·        After resources are available it would be restarted and let run

·        After completion all resulting files are merged.

Supose these bash scripts are there (I know how to do it).

The question is how to incorporate it in the slurm scheduling mechanism.



Oytun Peksel

Eng

Simulation & Digital Twins



Semcon Sweden AB

Lindholmsallén 2

417 80 GÖTEBORG

Sweden



Phone



+46739205917

Mobile



+46739205917

E-mail



oytun.pek...@semcon.com <mailto:oytun.pek...@semcon.com>



www.semcon.com<http://www.semcon.com>





Follow us: LINKEDIN<https://www.linkedin.com/company/semcon>  
FACEBOOK<https://www.facebook.com/semcon>  TWITTER<https://twitter.com/Semcon>  
YOUTUBE<https://www.youtube.com/user/SemconGlobal>  
INSTAGRAM<https://www.instagram.com/semcon>




When you communicate with us or otherwise interact with Semcon, we will process 
personal data that you provide to us or we collect about you, please read more 
in our Privacy Policy<https://semcon.com/data-privacy-policy/>.

Reply via email to