See my answers inline. Nifty Tom Mitchell wrote: > On Tue, Dec 02, 2008 at 10:24:15AM -0500, Prentice Bisbal wrote: >> I'm getting this error when I run ibchecknet on my cluster: >> >> #warn: counter VL15Dropped = 476 (threshold 100) lid 1 port 1 >> Error check on lid 1 (aurora HCA-1) port 1: FAILED >> >> I've googled around this morning, but haven't found anything helpful. >> Most of the hits turn up code with the phrase "VL15Dropped", but nothing >> explaining what this error means, what causes it, or how to fix it. >> >> After clearing the counters with 'perfquery -r', the VL15Dropped count >> starts increasing from zero almost immediately. >> >> Any ideas what this error represents or how to fix? Could it be a bad >> cable? >> > > Can you be specific about the hardware (HCA and switch) and software? > How large is the fabric? > What subnet manager is running and where? > > The host behind LID-1 is the one of interest.
IB Switch: Cisco 7012 D, 144-port HCAs: Cisco, which is really Mellanox: # lspci | grep Infini 0b:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) The subnet manager is OpenSM 3.1.8-1.el5, which is provided by my Linux Distro, PU_IAS 5.2, which is a rebuild of RHEL 5.2. It is running on the master node, aurora. The HCA with the error is on this node (see errors message in original post). > > If I recall correctly, VL15 is reserved exclusively for subnet management > and is not optional. Traffic to VL15 might be randomly dropped by the > switch, SMA or interrupt handler. As long as the subnet is OK modest > dropped traffic on VL15 may not be an issue. > > What is running on the fabric concurrently with ibchecknet (and on the LID-1 > host)? Not sure what you mean. Do you want to see the output of ibchecknet? > > Subnet management traffic should be light, very light. Tell us about > the subnet manager situation on your fabric. There should only > be one active subnet manager. Mixed and uncooperating SMs could > cause this, as could basic IB errors (connectors, cables, connections). > If the SM is running on LID-1 then traffic will reflect the fabric size. There is only one SM running. It's running on the master node. The other nodes don't even have the OpenSM package installed. > > What other IB errors are you seeing.. If the port for LID-1 is not seeing > IB errors other than VL15 you should be OK -- do look for multiple SMs. I'm not seeing any other errors. This one is a new development, too. > If you stop your subnet manager does the counter reflect the pause. > Haven't tried yet. And since it's almost quitting time, I'm not going to try until tomorrow. -- Prentice _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf