On Tue, Jan 10, 2006 at 10:16:23AM -0800, Jesse Brandeburg wrote:
>On Mon, 9 Jan 2006, Robin Humble wrote:
>>until we turned off tso on our cluster using
>>  ethtool -K eth0 tso off
>>  ethtool -K eth1 tso off
...
>>the major problems only happen for >32 cpu parallel runs. smaller runs
>>work fine. unfortunately we haven't found a simple small MPI code that
>>triggers the tso problems.
>do you know what packet size triggered the problem?  It sounds like the 

unfortunately not really.

>network traffic at the time of failure is lots and lots of outstanding 
>transmits over many concurrent connections, is that correct?

repeatedly typing 'netstat -t' the most traffic I see is something like:

  [EMAIL PROTECTED] ~]# netstat -t | grep -v '  0      0 '
  Active Internet connections (w/o servers)
  Proto Recv-Q Send-Q Local Address      Foreign Address    State
  tcp    21720      0 beer96:57747       beer82:42185       ESTABLISHED
  tcp    37176      0 beer96:57747       beer76:45725       ESTABLISHED
  tcp    37176      0 beer96:38197       beer76:39912       ESTABLISHED
  tcp    37176      0 beer96:57747       beer86:54898       ESTABLISHED
  tcp       24      0 beer96:57747       beer74:57982       ESTABLISHED
  tcp       24      0 beer96:38197       beer74:59163       ESTABLISHED
  tcp    37176      0 beer96:57747       beer86:54883       ESTABLISHED
  tcp       24      0 beer96:57747       beer74:57967       ESTABLISHED
  tcp       24      0 beer96:38197       beer74:59178       ESTABLISHED
  tcp    37176      0 beer96:38197       beer82:34102       ESTABLISHED
  tcp    37176      0 beer96:38197       beer82:34117       ESTABLISHED
  tcp        0  37176 beer96:38197       beer86:48448       ESTABLISHED
  tcp    37176      0 beer96:57747       beer84:54840       ESTABLISHED
  tcp    37176      0 beer96:57747       beer84:54827       ESTABLISHED
  tcp    37176      0 beer96:38197       beer80:42088       ESTABLISHED
  tcp    17376      0 beer96:57747       beer90:55069       ESTABLISHED
  tcp    21720      0 beer96:57747       beer90:55058       ESTABLISHED
  tcp    37176      0 beer96:57747       beer81:57843       ESTABLISHED
  tcp    37176      0 beer96:38197       beer91:51101       ESTABLISHED
  tcp    24792      0 beer96:57747       beer75:44797       ESTABLISHED
  tcp    37176      0 beer96:38197       beer77:54658       ESTABLISHED
  tcp    37176      0 beer96:57747       beer83:46847       ESTABLISHED
  tcp        0  23168 beer96:57747       beer83:46826       ESTABLISHED
  tcp    21248    976 beer96:57747       beer89:48842       ESTABLISHED
  tcp       24      0 beer96:38197       beer75:51657       ESTABLISHED
  tcp       24      0 beer96:57747       beer73:52901       ESTABLISHED
  tcp       24      0 beer96:57747       beer73:52886       ESTABLISHED
  tcp       48      0 beer96:38197       beer73:47969       ESTABLISHED
  [EMAIL PROTECTED] ~]# 

there's about 140 sockets open total, but the rest have no traffic in
them at this instant.

>>we'd like to use tso as it means 5 to 10% less cpu usage for large
>>message sizes (but strangely a few more micro-seconds latency).
>>see attached pic.
>Thats what TSO is supposed to help.  The latency increase can be played 
>with or mitigated by changing tcp_tso_win_divisor in /proc/.../ipv4

cool. thanks.

>>so what's the best way I can help you debug TCP segmentation offload
>>issue?
>we can start with getting some transmit ring dumps at the time of failure. 
>I have code to do this but need to port it to 2.6.15.  i'll try to get 
>that code to you in the next couple of days.

ok. ta.
actually the tx reset only happens occasionally with the 2.6.15 kernel
and 6.3.9 e1000 driver. mostly the code just stops - presumably 'cos a
message got lost. I'll check more if it's at a repeatable place...

cheers,
robin
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to