On 7/13/2010 4:50 PM, Prentice Bisbal wrote:
Bill,

Have you checked the health of the cables themselves? It could just be
dumb luck that a hardware failure coincided with a software change,
didn't manifest itself until the reboot of the nodes. Did you reboot the
switches, too?
Just looked at all the lights and they all seem fine.

I would try dividing your cluster into small sections and see if the
problem exists across the sections.

Can you disconnect the edge switches from the core switch, so that each
edge switch is it's own, isolated fabric? If so, you could then start an
sm on each fabric and see if the problem is on every smaller IB fabric,
or just one.
I've thought about this one. Non-trivial. I have a core switch connecting 12 leaf switches. Each switch connects to 16 nodes. I need to use that core switch in order to make the problem appear.
The other option would be to disconnect all the nodes and add them back
one  by one, but that wouldn't catch a problem with a switch-to-switch
connection.

How big is the cluster? Would it take hours or days to test each node
like this?

192 nodes (8 cores each).
You say the problem occurs when the node count goes over 32 (or 40) do
you mean 32 physical nodes, or 32 processors. How does your scheduler
assign nodes? Would those 32 nodes always be in the same rack or on the
same IB switch, but not when the count increases?
It starts failing at 48 nodes. PBS allocates as least loaded, round robin fashion. But sequentially, minus the PVFS nodes, which are distributed throughout the cluster and allocated last in round robin. The 32 nodes definately go through the core. And it never seems to matter where. I've tried to pinpoint some nodes by keeping lists but this happens everywhere. I was hoping that some tool I'm not aware of exists but apparently not. My next attempt may be to pull the management card from the core and just run opensm on nodes themselves, like we do for other clusters. But I can test with osmtest all day and never get errors. This makes me feel very uncomfortable!

Of course, nothing is under warranty anymore. Divide and conquer seems like the only solution.

Thanks,
Bill
Prentice



Bill Wichser wrote:
Just some more info.  Went back to the prior kernel with no luck.
Updated the firmware on the Topspin HBA cards to the latest (final)
version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not
sure where to look.

Bill Wichser wrote:
Machine is an older Intel Woodcrest cluster with a two tiered IB
infrastructure with Topspin/Cisco 7000 switches.  The core switch is a
SFS-7008P with a single management module which runs the SM manager.
The cluster runs RHEL4 and was upgraded last week to kernel
2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much
stock.

After rebooting, the IB cards in the nodes remained in the INIT
state.  I rebooted the chassis IB switch as it appeared that no SM was
running.  No help.  I manually started an opensm on a compute node
telling it to ignore other masters as initially it would only come up
in STANDBY.  This turned all the nodes' IB ports to active and I
thought that I was done.

ibdiagnet complained that there were two masters.  So I killed the
opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back
with OSMTEST: TEST "All Validations" PASS.
ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with
everything coming up roses.

The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
node count goes over 32 (or maybe 40).  This worked fine in the past,
before the reboot.  User apps are failing as well as IMB v3.2.  I've
increased the timeout using the "mpiexec -mca btl_openib_ib_timeout
20" which helped for 48 nodes but when increasing to 64 and 128 it
didn't help at all.  Typical error message follow.

Right now I am stuck.  I'm not sure what or where the problem might
be.  Nor where to go next.  If anyone has a clue, I'd appreciate
hearing it!

Thanks,
Bill


typical error messages

[0,1,33][btl_openib_component.c:1371:btl_openib_component_progress]
from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
[0,1,36][btl_openib_component.c:1371:btl_openib_component_progress]
from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
[0,1,40][btl_openib_component.c:1371:btl_openib_component_progress]
from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
--------------------------------------------------------------------------

The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

    The total number of times that the sender wishes the receiver to
    retry timeout, packet sequence, etc. errors before posting a
    completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 10).  The actual timeout value used is calculated as:

     4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
--------------------------------------------------------------------------

--------------------------------------------------------------------------


DIFFERENT RUN:

[0,1,92][btl_openib_component.c:1371:btl_openib_component_progress]
from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
...
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to