Re: [Beowulf] power usage, Intel 5160 vs. AMD 2216

Joe Landman Fri, 13 Jul 2007 06:22:37 -0700

Mark Hahn wrote:

The 2GB dimms emit the same heat as the 1 GB dimms. So if you have a1000 node cluster, and you use the larger (slightly more expensive)2GB dimms vs the 1GB dimms, you will emit somewhat less heat. Ihaven't done the
assuming the same number of chips per dimm.  if your 1G are single-sided,
and 2G are double, you save nothing. it's also interesting that for agiven
generation chip, the higher-clocked dimms are significantly hotter
(say, 200 vs 300 mA max draw for 1G pc2/667 vs /800).

I haven't seen too many single sided DIMMs these days for registered ECCram in x4/x8 flavors. Maybe my horizons are not broad enough :) Youare correct though, I had been assuming the same number of chips.Though as I understand things ...

also, I notice that x16 chips dissipate a lot more than x4 or x8, even
though the chips have the same number of onchip banks. I guess thissays that the main power issue is driving wide parallel buses at speed...


... the drivers are the power expensive elements.

That and few parts means lower absolute number of failures, but thatis another issue.
a very interesting one. I wonder how many people have scrubbing turnedon in their cluster, and how many use mcelog to monitor the ECC rate.

We do on clusters we ship/build. I specifically run a tests to fleshout the memory errors. Sadly, memtest86 only gets the "obvious" errors,you will catch errors with that in most cases fairly quickly. I runseveral heavy duty (electronic structure) codes that pound on memory andCPU. Using that, we have found many mce errors that memtest86 misses.Most of the mce errors are single bit ecc errors, more often due totiming and access patterns than simple sequential walk through memory(memtest86). Nothing stresses memory like real applications.

Moreover, it is pretty easy to deduce which chip is problematic(assuming it is ram) based upon the address. It isn't always ram,mcelog has shown us some northbridge/southbridge type errors as well.


From this

MCE 0
CPU 0 4 northbridge TSC 2ce665a9f4c0
ADDR 117600
  Northbridge Chipkill ECC error
  Chipkill ECC syndrome = e214
       bit32 = err cpu0
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
      generic read mem transaction
      memory access, level generic'
STATUS d40a4001e2080813 MCGSTATUS 0

you see address 0x117600. With a quick bit of Octave, you can convertthat address to a DIMM pair (you are inserting them in pairs, right?) ifyou have bank interleaving on, and node interleaving off. THe lattermesses up this calculation.


octave:1> gigabyte=1024*1024*1024
gigabyte = 1073741824
octave:2> 0x117600/gigabyte
ans = 0.0010657

which suggests it is in the 0-1 DIMM pair (gigabyte sized dimms). Youcan replace one, and try it again. I err on the side of replacing both(the banking impacts the calculation as well).

comments?

mcelog is your friend. Install it/use it if possible. Keep a few spareRAM dimms on hand in a storage locker somewhere for fast swap out.


thanks, mark.



--

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] power usage, Intel 5160 vs. AMD 2216

Reply via email to