Dear all: Around 2/Apr I removed 2 Opterons 246 and "companion" 4x 512 MB DIMMs from two HPs DL145-G2, leaving them void, to populate other two HPs (got 2 CPUs and 4GB per node).
Then, I installed 2 dual-core Opterons per DL145-G2, together with 4 sticks of 1GB (2 sticks per CPU). So I have 2 DL145-G2 nodes with 2 single-core 246 / 4GB each, and 2 DL145-G2 nodes with 2 dual-core 275 / 4GB each. On 18th/Apr, one of the dual-core nodes crashed with an ECC error. From IMPI, for that node, 04/18/2008 | 20:26:26 | Memory #0x02 | Uncorrectable ECC | Asserted 06/18/2008 | 12:00:16 | Memory #0x02 | Uncorrectable ECC | Asserted 06/23/2008 | 11:58:34 | Memory #0x02 | Uncorrectable ECC | Asserted 07/19/2008 | 22:41:12 | Memory #0x02 | Uncorrectable ECC | Asserted 07/22/2008 | 17:18:00 | Memory #0x02 | Uncorrectable ECC | Asserted 07/23/2008 | 22:08:15 | Memory #0x02 | Uncorrectable ECC | Asserted 07/28/2008 | 17:52:23 | Memory #0x02 | Uncorrectable ECC | Asserted On 07/19 the memory of CPU0 was replaced; on the 27th, the remaining memory was replaced. ECC crashes do continue, from 1 per day to 1 per week. 07/28: first ECC error on the other Opteron-275 populated node. 07/28/2008 | 18:54:23 | Memory #0x02 | Uncorrectable ECC | Asserted All nodes have IB boards, and I swapped the boards from the first crashing and second crashing nodes (that's when, a few days later, the second node crashed the very first time). I have observed that not more than 2 minutes away from the ECC there are always these events logged: 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S0/G0: working | Asserted 06/18/2008 | 11:58:16 | System ACPI Power State #0x01 | S5/G2: soft-off | Deasserted (but they are logged also at other times) I am running Scientific Linux 5, the (lam) MPI application uses almost 100% CPU and does exchange lots of small packets through IPoIB (I have not used "native" IB yet). "Everything" is 64-bit (kernel, apps). Any thoughts? Best Regards, paulo lopes -- Paulo Afonso Lopes | Tel: +351- 21 294 8536 Departamento de Informática | 294 8300 ext.10763 Faculdade de Ciências e Tecnologia | Fax: +351- 21 294 8541 Universidade Nova de Lisboa | e-mail: [EMAIL PROTECTED] 2829-516 Caparica, PORTUGAL _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf