Prentice Bisbal wrote: > The server is a Dell PowerEdge R815 with 4 8-Core AMD processors and 128 > GB of RAM.
If the erroneous memory locations are moving around in memory without correlation to the DIMMs then the next most likely culprits are a marginal power supply, CPU, or motherboard, in pretty much that order. (OK, kind of a toss up for CPU vs. motherboard, but since you have 32 cores in the system I put it first.) If you have access to an oscilloscope look closely at the voltages on the two machines. No need to cut in anywhere, just measure +5 and +12V on an unused disk or fan connector. If the machine prone to memory errors is significantly noisier than the one that is not, that could be the problem. I have seen this exactly once - all PS testers said it was good, and a multimeter had it pegged at the right voltages, but there was a ton of high frequency noise coming out of the power supply. If you can disable CPUs through the BIOS on that machine, running for a while under each CPU alone might narrow the issue down to 1 of the 4. You wouldn't be done then though, because it could be the socket and not the CPU itself. Still, if you can get it down to 1 CPU then you could swap that with another and see if the issue moves with it. You probably already did this, but be sure both machines have the same BIOS release. Regards, David Mathog mat...@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf