On 2/19/23 10:26 am, Scott Atchley wrote:

Hi Chris,

Hi Scott!

It looks like it tries to checkpoint application state without checkpointing the application or its libraries (including MPI). I am curious if the checkpoint sizes are similar or significantly larger to the application's typical outputs/checkpoints. If they are much larger, the time to write will be higher and they will stress capacity more.

Hmm, I'm not sure (my involvement is relatively peripheral) but I think we want to see this used with apps that have no existing C/R mechanism. If you ping me directly I can point you to people who will know more than I on this.

We are looking at SCR for Frontier with the idea that users can store checkpoints on the node-local drives with replication to a buddy node. SCR will manage migrating non-defensive checkpoints to Lustre.

Interesting, does it really need local storage or can it be used with diskless systems via tricks with loopback filesystems, etc?

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

Reply via email to