John's post is really funny! But I would only endorse Gavin's recommendation for it solves the problem statistically (and correctly).
Justin On Wed, Oct 26, 2016 at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > On 26/10/16 14:45, John Hanks wrote: > > > I'd suggest making NFS mounts hard, so processes can recover from an NFS > > server reboot. > > ...plus set the NFS fsid for each export server side so they come back > reproducibly each time... > > PS: I endorse what John said (now I've finished laughing), I'd suggest > making sure you've at least got ECC memory though and RAID as those are > the two parts that can go bad. When we had clusters with disks in > compute nodes those were the most frequent failures, now we run diskless > nodes it's memory DIMMs. :-) > > All the best, > Chris > -- > Christopher Samuel Senior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf