Mark Hahn wrote:
The 2GB dimms emit the same heat as the 1 GB dimms. So if you have a 1000 node cluster, and you use the larger (slightly more expensive) 2GB dimms vs the 1GB dimms, you will emit somewhat less heat. I haven't done the

assuming the same number of chips per dimm.  if your 1G are single-sided,
and 2G are double, you save nothing. it's also interesting that for a given
generation chip, the higher-clocked dimms are significantly hotter
(say, 200 vs 300 mA max draw for 1G pc2/667 vs /800).

I haven't seen too many single sided DIMMs these days for registered ECC ram in x4/x8 flavors. Maybe my horizons are not broad enough :) You are correct though, I had been assuming the same number of chips. Though as I understand things ...


also, I notice that x16 chips dissipate a lot more than x4 or x8, even
though the chips have the same number of onchip banks. I guess this says that the main power issue is driving wide parallel buses at speed...

... the drivers are the power expensive elements.

That and few parts means lower absolute number of failures, but that is another issue.

a very interesting one. I wonder how many people have scrubbing turned on in their cluster, and how many use mcelog to monitor the ECC rate.

We do on clusters we ship/build. I specifically run a tests to flesh out the memory errors. Sadly, memtest86 only gets the "obvious" errors, you will catch errors with that in most cases fairly quickly. I run several heavy duty (electronic structure) codes that pound on memory and CPU. Using that, we have found many mce errors that memtest86 misses. Most of the mce errors are single bit ecc errors, more often due to timing and access patterns than simple sequential walk through memory (memtest86). Nothing stresses memory like real applications.

Moreover, it is pretty easy to deduce which chip is problematic (assuming it is ram) based upon the address. It isn't always ram, mcelog has shown us some northbridge/southbridge type errors as well.

From this

MCE 0
CPU 0 4 northbridge TSC 2ce665a9f4c0
ADDR 117600
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = e214
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d40a4001e2080813 MCGSTATUS 0

you see address 0x117600. With a quick bit of Octave, you can convert that address to a DIMM pair (you are inserting them in pairs, right?) if you have bank interleaving on, and node interleaving off. THe latter messes up this calculation.

octave:1> gigabyte=1024*1024*1024
gigabyte = 1073741824
octave:2> 0x117600/gigabyte
ans = 0.0010657

which suggests it is in the 0-1 DIMM pair (gigabyte sized dimms). You can replace one, and try it again. I err on the side of replacing both (the banking impacts the calculation as well).


comments?

mcelog is your friend. Install it/use it if possible. Keep a few spare RAM dimms on hand in a storage locker somewhere for fast swap out.


thanks, mark.


--

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to