Hi all, on behalf of Jörg I forward this to the list, as his account seems to be blocked to post to this list any longer.
-- Reuti > ############# > Dear all, > > as I cannot post directly to the list although I am subscribing to it, I have > asked a friend of mine to post that for me. > I am currently having severe problems with one of the clusters I am > maintaining. Around 50% of these nodes are crashing when we are running cp2k > on it. Although they are IB nodes, even without the IB card installed the > test > jobs crash the node as well. So I can rule out an IB related problem. Memtest > was ok, I done 9 cycles without any problems. Unfortunately I cannot swap the > memory as I don't have any of them at all and hence I have to rely on Memtest > here. The nodes which are causing the problems show other symptoms as well: I > had problem with 3 of them to boot again after a normal shutdown procedure > (the fans come on, and die after a short period and I don't even get to the > POST stage at all). So they are offline as well. Two of the remaining nodes > were > exceedingly hot after a reboot. When I took them out the fans were spinning > and now they appear to be ok. These are AMD Opteron 2220 dual core processors > with 2 CPUs per node. The mother board is a H8DMR-82 with the BIOS version > 080014 (release date 07/13/2007). It appears that almost always the same > nodes > are crashing with this error message: > > Hardware Error > CPU0 Machine Check Exception 4 Bank 2 b200200000000863 > TSC 108dd369444 > Processor 2:40f13 Time 1311847912 Socket 0 APIC 0 > MC2-Status: Uncorredted error, report: yes MisV: invalid > CPU context corrupt: yes UECC Error > Bud Unit Error: prefetch/ECC error in data read from NB: local node > originated > (SRC) > Transaction type: prefetch (mem access), no timeout, cache level L3/generic. > Participating Processors: local node originated (SRC) > > Judging from this I would guess there is a memory related problem. > Given there are a number of people on the list here and they probably have > seen similar hardware before, do I simply have a bad batch of hardware which > is known to cause problems or do I have a different issue here? What I am > after > is some kind of idea of where to look next. It is not the compiled program as > taking out the disc and placing it in a different node (same motherboard, > same > Opteron but slightly different flags) does not cause any problems at all. > Given the large number of nodes which causing problems, before I am proposing > to write off these nodes I would like to make sure it is not a subtle issue > like a BIOS upgrade which could cure the problem. > > Many thanks for your help and all the best from London > > Jörg > > ############## > > > > -- > ************************************************************* > Jörg Saßmannshausen > University College London > Department of Chemistry > Gordon Street > London > WC1H 0AJ > > email: j.sassmannshau...@ucl.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf