Re: thunderx sgmii interface hang

David Daney Fri, 22 Dec 2017 16:31:04 -0800

On 12/22/2017 04:22 PM, Tim Harvey wrote:

On Fri, Dec 22, 2017 at 3:00 PM, David Daney <dda...@caviumnetworks.com> wrote:

On 12/22/2017 02:19 PM, Tim Harvey wrote:


On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn <and...@lunn.ch> wrote:


On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:


On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn <and...@lunn.ch> wrote:


The nic appears to work fine (pings, TCP etc) up until a performance
test is attempted.
When an iperf bandwidth test is attempted the nic ends up in a state
where truncated-ip packets are being sent out (per a tcpdump from
another board):



Hi Tim

Are pause frames supported? Have you tried turning them off?

Can you reproduce the issue with UDP? Or is it TCP only?


Andrew,

Pause frames don't appear to be supported yet and the issue occurs
when using UDP as well as TCP. I'm not clear what the best way to
troubleshoot this is.



Hi Tim

Is pause being negotiated? In theory, it should not be. The PHY should
not offer it, if the MAC has not enabled it. But some PHY drivers are
probably broken and offer pause when they should not.

Also, can you trigger the issue using UDP at say 75% the maximum
bandwidth. That should be low enough that the peer never even tries to
use pause.

All this pause stuff is just a stab in the dark. Something else to try
is to turn off various forms off acceleration, ethtook -K, and see if
that makes a difference.


Andrew,

Currently I'm not using the DP83867_PHY driver (after verifying the
issue occurs with or without that driver).

It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
and the issue still occurs.

I have found that once the issue occurs I can recover to a working
state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
the issue and recover with that, I can never trigger the issue again.
If toggle that register bit upon power-up before the issue occurs it
will still occur.

The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
- when cleared all dedicated BGX context state for LMAC (state
machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
resources (data path, serdes lanes) is disabled
- when set LMAC operation is enabled (link bring-up, sync, and tx/rx
of idles and fault sequences)



You could try looking at
BGXX_GMP_PCS_INTX
BGXX_GMP_GMI_RXX_INT
BGXX_GMP_GMI_TXX_INT

Those are all W1C registers that should contain all zeros.  If they don't,
just write back to them to clear before running a test.

If there are bits asserting in these when the thing gets wedged up, it might
point to a possible cause.


David,

BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is
triggered. From CN80XX-HM-1.2P this is caused by:

"In the unlikely event that P2X data cannot keep the GMP TX FIFO full,
the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This
should be detected by the receiving device as an FCS error.
Internally, the packet is drained and lost"


Yikes!

Perhaps this needs to be caught and handled in some way. There's some
interrupt handlers in nicvf_main.c yet I'm not clear where to hook up
this one.

This would be an interrupt generated by the BGX device, not the NICdevice It will have an MSI-X index of (6 + LMAC * 7). SeeBGX_INT_VEC_E in the HRM.

Note that I am telling you which interrupt it is, but not recommendingthat catching it and doing something is necessarily the best thing to do.


You could also look at these RO registers:
BGXX_GMP_PCS_TXX_STATES
BGXX_GMP_PCS_RXX_STATES


These show the same before/after triggering the issue and
RX_BAD/TX_BAD are still 0.

Tim

Re: thunderx sgmii interface hang

Reply via email to