On Wed, Aug 13, 2008 at 05:03:46PM +0100, Dave Love wrote:
> [I know in an ideal world the vendor between us and PathScale^WQlogic
> would sort this out.]
> 
> I'm interested in the cause (and possible cure!) of intermittent errors
> on various nodes in our Infinipath system which stop MPI jobs with
> kernel messages like this, in case anyone's familiar with them:
> 
>   lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]}
> 
> They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but
> probably just manifested themselves in some other way previously.
> 
> Google didn't produce any leads, and a brief look in the source suggests
> that tracking it down where it's generated in the ib_ipath module is
> non-trivial and likely won't tell me a lot.
> 
> For what it's worth, the adaptors are
> 
>   06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02)
> 
> in two different sorts of Supermicro whose model numbers I don't know.
> 

Dave,

Which driver is active?  Which Infinipath software release
is installed?  The tool "ipath_control -i" can show which...

The kernel.org/ofed driver does not have as rich a set of error recovery
code for this card as the shipped driver.   The recovery code was seen
as a badness and not accepted by the kernel.org folk....

With a kernel update the driver will not have been recompiled
and the kernel.org driver would become active.   
Look for this stuff in the Install Guide.

        #   To rebuild the drivers, do the following (as root):
        # cd /usr/src/infinipath/drivers
        # ./make-install.sh
        # /etc/init.d/infinipath restart






 






-- 
        T o m  M i t c h e l l 
        Got a great hat... now what.

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to