Re: [Beowulf] New member, upgrading our existing Beowulf cluster

Håkon Bugge Thu, 03 Dec 2009 23:34:38 -0800

Hi,


On Dec 4, 2009, at 3:34 , Chris Samuel wrote:


How does it deal with pinned DMA memory on NICs ?

What we did in Platform (Scali) MPI, was to drain the HPCinterconnect, then close it down. The problem was then reduced tocheckpoint (e.g. using BLCR) N processes. Continuing from checkpointand restarting from it would both re-open the HPC fabric (could be onanother physical medium though). You could take the checkpoint on IBand restart using Gbe.

Combined with an agnostic interconnect support, this feature allowsyou in the case of a failing IB HCA (or failing switch port or cable)to restart from last the checkpoint, runn M-1 nodes communicating withother M-2 IB capable nodes using IB, and the last node communicatingwith the M-1 nodes using Gbe.

Traditional checkpointing requires snap-shot of the file-system in thegeneral case (and restore of the correct snap-shot at restart),whereas checkpoint-and-kill (for migration or preemptive batchscheduling) does not require integration with file-systems.



Håkon




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] New member, upgrading our existing Beowulf cluster

Reply via email to