On Wed, Aug 13, 2008 at 05:03:46PM +0100, Dave Love wrote: > [I know in an ideal world the vendor between us and PathScale^WQlogic > would sort this out.] > > I'm interested in the cause (and possible cure!) of intermittent errors > on various nodes in our Infinipath system which stop MPI jobs with > kernel messages like this, in case anyone's familiar with them: > > lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]} > > They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but > probably just manifested themselves in some other way previously. > > Google didn't produce any leads, and a brief look in the source suggests > that tracking it down where it's generated in the ib_ipath module is > non-trivial and likely won't tell me a lot. > > For what it's worth, the adaptors are > > 06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02) > > in two different sorts of Supermicro whose model numbers I don't know. >
Dave, Which driver is active? Which Infinipath software release is installed? The tool "ipath_control -i" can show which... The kernel.org/ofed driver does not have as rich a set of error recovery code for this card as the shipped driver. The recovery code was seen as a badness and not accepted by the kernel.org folk.... With a kernel update the driver will not have been recompiled and the kernel.org driver would become active. Look for this stuff in the Install Guide. # To rebuild the drivers, do the following (as root): # cd /usr/src/infinipath/drivers # ./make-install.sh # /etc/init.d/infinipath restart -- T o m M i t c h e l l Got a great hat... now what. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf