Re: [E1000-devel] tso off, tx hang, and codes crashing

Jesse Brandeburg Tue, 10 Jan 2006 10:19:02 -0800

On Mon, 9 Jan 2006, Robin Humble wrote:

until we turned off tso on our cluster using
  ethtool -K eth0 tso off
  ethtool -K eth1 tso off


then for certain sized runs of some codes we were getting:
  KERNEL: assertion (!sk->sk_forward_alloc) failed at net/core/stream.c (279)
  KERNEL: assertion (!sk->sk_forward_alloc) failed at net/ipv4/af_inet.c (148)

Does anyone on netdev know why this would be relevant to TSOenable/disable???

and a bunch of errors like this:
  e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
    TDH                  <26>
    TDT                  <13>
    next_to_use          <13>
    next_to_clean        <25>
  buffer_info[next_to_clean]
    time_stamp           <79694e6>
    next_to_watch        <2a>
    jiffies              <79695b9>
    next_to_watch.status <0>
  e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
    TDH                  <26>
    TDT                  <13>
    next_to_use          <13>
    next_to_clean        <25>
  buffer_info[next_to_clean]
    time_stamp           <79694e6>
    next_to_watch        <2a>
    jiffies              <7969681>
    next_to_watch.status <0>
  NETDEV WATCHDOG: eth0: transmit timed out
  e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex

those errors are kinda serious as TCP messages were being lost and
codes were hanging and crashing. usually when codes died we saw the
KERNEL assert, but sometimes they ran into problems with just the eth0
or eth1 NETDEV resets :-/

machines are dual Xeon 2.4GHz on e7500 chipset with 1G of ram, built-in
dual e1000 82546EB's and are running RedHat EL AS4.

I tried a range of 2.6 kernels from 2.6.12 up to 2.6.15, and the
latest e1000 driver 6.3.9-NAPI as well as 6.2.15-NAPI and the default
drivers in the kernels (eg. 6.1.16-k2-NAPI).
apart from different syntax in the error message they all behaved the
same (ie. the codes died unless we set tso=off).
various ITR's didn't help. our default is 15000 ITR.

Thanks for trying the latest driver and kernel, that really helps us getstarted.

the major problems only happen for >32 cpu parallel runs. smaller runs
work fine. unfortunately we haven't found a simple small MPI code that
triggers the tso problems.

do you know what packet size triggered the problem? It sounds like thenetwork traffic at the time of failure is lots and lots of outstandingtransmits over many concurrent connections, is that correct?

we'd like to use tso as it means 5 to 10% less cpu usage for large
message sizes (but strangely a few more micro-seconds latency).
see attached pic.

Thats what TSO is supposed to help. The latency increase can be playedwith or mitigated by changing tcp_tso_win_divisor in /proc/.../ipv4

so what's the best way I can help you debug TCP segmentation offload
issue?

we can start with getting some transmit ring dumps at the time of failure.I have code to do this but need to port it to 2.6.15. i'll try to getthat code to you in the next couple of days.


Jesse
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [E1000-devel] tso off, tx hang, and codes crashing

Reply via email to