Hi, I would like to setup a queing system for multiple users with limited resources. I'll have only 1 node and 48 cpus to work with. So I am using select/cons_res for select type. Have to use preemption because there are many jobs running with 3 different partitions with different priorities. The gang,suspend preemption works fine however it limits my suspended jobs to a few cause the average job memory consumption is pretty high.
I read about BCLR and DMTCP checkpointing and got the impression that it has a huge overhead and maybe not quite ready yet. The jobs we run here has application level checkpointing (like abaqus,ansys etc. ) I am wondering would there be way to incorporate application level checkpoint/restart features to slurms preemption features using bash scripts. Namely : · A low priority job would be checkpointed, canceled and requeued. · After resources are available it would be restarted and let run · After completion all resulting files are merged. Supose these bash scripts are there (I know how to do it). The question is how to incorporate it in the slurm scheduling mechanism. Oytun Peksel Eng Simulation & Digital Twins Semcon Sweden AB Lindholmsallén 2 417 80 GÖTEBORG Sweden Phone +46739205917 Mobile +46739205917 E-mail oytun.pek...@semcon.com <mailto:oytun.pek...@semcon.com> www.semcon.com<http://www.semcon.com> Follow us: LINKEDIN<https://www.linkedin.com/company/semcon> FACEBOOK<https://www.facebook.com/semcon> TWITTER<https://twitter.com/Semcon> YOUTUBE<https://www.youtube.com/user/SemconGlobal> INSTAGRAM<https://www.instagram.com/semcon> When you communicate with us or otherwise interact with Semcon, we will process personal data that you provide to us or we collect about you, please read more in our Privacy Policy<https://semcon.com/data-privacy-policy/>.