On Thu, 08 Apr 2004, martin f krafft wrote: > also sprach Henrique de Moraes Holschuh <[EMAIL PROTECTED]> [2004.04.07.1633 +0200]: > > Let me guess: VIA chipset? I have a A7V motherboard that does the > > same, unpredictably. The PCI bus just hangs the entire machine. > > After that one, I tried to learn a thing or two about common > > consumer computer stabilities. > > Does your hanging sometimes come with Ooooopses and BUGs? Does it
No. It is a machine totally dead, CPU won't even NMI, soundboard will keep looping whatever is in its buffer kind of bug. Probably a northbridge issue. I can trigger it by increasing the PCI activity. Playing sound while doing heavy network (PCI NIC) and disk IO would crash it sooner or later. Removing everything offboard (but the videocard) won't fix the issue either, although it does make it far less likely. Using a PCI videocard instead of a AGP one makes no difference. > > ECC memory is extremely more resilient to corruption. It WILL > > experience bit flips as often as common memory, obviously... But > > you need two bit flips *in a certain area* (that must happen > > before the affected area is accessed again), to get memory > > corruption. That is far more unlikely to happen. > > Can memtest86 detect such errors? Yes, if you change it to do so, and you're using Memtest86+ (note the "+"). I believe the upcoming Memtest86+ version will have this test mode added. Basically, you write, let the memory alone for some hours for the bit rot to show up, and read it back. Works with and without ECC. Almost all bit flips are due to power supply issues and other eletrical noise, but you could have some due to high-energy particles (cosmic rays) too, I suppose. The northbridge will signal the CPU if it detects an ECC error, I _think_, but I have never seen one detected in my life with Linux (which is kinda suspicious, as it is supposed to warn you on correctable errors as well). > > You need a top-notch power supply and good cooling too, of course. > > Most power supplies aren't adequate for non-error operation. You > > have to handpick them. And the good ones ain't cheap. > > Does it suffice to connect it to a stabilising UPS? No. If you have a double-converter UPS with extremely high quality inverters then yes, it would remove all external noise from the power main, and introduce little of its own. Some UPS might even make the matters worse. But if the power supply can't do a proper job of handling the kind of current load your system has, it will not be enough. And the Athlon can do some very demanding things to the power rails, such as a high current square wave if you let it run in power saving mode (disconect from northbridge when idle). Good PSUs won't bat an eye to that load. Bad ones will fail to regulate the output voltage on the rail, which could cause trouble. > My current thinking (actually, mostly Herbert's) is that it's an > IRQ-related problem. I tried booting with all permutations of ACPI, > APIC and LAPIC, but no dice with either. It wouldn't be the first > time that IRQs get an x86 machine down. Try to make sure you don't have external PCI cards sharing IRQs with the internal devices, and see if that improve things. Disable every onboard PCI device that you don't need. Enable the LAPIC NMI watchdog, and see if that causes better (or worse behaviour). Try it with the IOAPIC NMI watchdog as well. But really, before you do *anything*, get memtest86+ and do a 24-hour burn-in test in the "extended" mode. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]