On 2/19/23 10:26 am, Scott Atchley wrote:
Hi Chris,
Hi Scott!
It looks like it tries to checkpoint application state without checkpointing the application or its libraries (including MPI). I am curious if the checkpoint sizes are similar or significantly larger to the application's typical outputs/checkpoints. If they are much larger, the time to write will be higher and they will stress capacity more.
Hmm, I'm not sure (my involvement is relatively peripheral) but I think we want to see this used with apps that have no existing C/R mechanism. If you ping me directly I can point you to people who will know more than I on this.
We are looking at SCR for Frontier with the idea that users can store checkpoints on the node-local drives with replication to a buddy node. SCR will manage migrating non-defensive checkpoints to Lustre.
Interesting, does it really need local storage or can it be used with diskless systems via tricks with loopback filesystems, etc?
All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf