On Wednesday 30 July 2008 09:13:56 am David Mathog wrote: > If one were to build nodes without ECC memory it would probably be a > good idea to reboot them from time to time to clean out whatever bad > bits might have accumulated. It then occurred to me that doing so > would require a trip through the BIOS on every reboot, at least on > every x86 based computer I'm familiar with. That is not a terrible > thing, but it made me wonder if it is really necessary.
I may be totally missing the point, but doesn't the memory need to be physically (as in electrically) reset in order to clean out those bad bits? And doesn't this require a hard reboot, for the machine to be power cycled, so that memory cells are reinitialized? I mean, if the BIOS stage is skipped, as in kexec'ing a new kernel, electrical initialization doesn't occur, and the bad bits will probably stick there. Unless the kernel does this kind of scrubbing in its initialization phase, which I don't know, I don't see any reason why the memory would be cleaned from errors. And another point I wonder about, is to know if a reboot would do any good for non-ECC memory anyway. As far as I understand it, a memory error is either a repeatable, hard one, like a bad chip, and a reboot won't change anything about it, since the hardware is faulty ; either a transient, soft error, where a bad value is read once, but where next reads are ok. So unless there's a sort of accumulation somewhere in the soft case, I don't really understand what a reboot could do about it? If you got some light to shed on this, I'd be interested. Cheers, -- Kilian _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf