Hi,

On Dec 4, 2009, at 3:34 , Chris Samuel wrote:

How does it deal with pinned DMA memory on NICs ?


What we did in Platform (Scali) MPI, was to drain the HPC interconnect, then close it down. The problem was then reduced to checkpoint (e.g. using BLCR) N processes. Continuing from checkpoint and restarting from it would both re-open the HPC fabric (could be on another physical medium though). You could take the checkpoint on IB and restart using Gbe.

Combined with an agnostic interconnect support, this feature allows you in the case of a failing IB HCA (or failing switch port or cable) to restart from last the checkpoint, runn M-1 nodes communicating with other M-2 IB capable nodes using IB, and the last node communicating with the M-1 nodes using Gbe.

Traditional checkpointing requires snap-shot of the file-system in the general case (and restore of the correct snap-shot at restart), whereas checkpoint-and-kill (for migration or preemptive batch scheduling) does not require integration with file-systems.


Håkon




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to