Well, I managed to concoct an updated test, this time with 1G's going into a 10G. A 2.6.23-rc8 kernel on the system with four, dual-port 82546GB's, connected to an HP ProCurve 3500 series switch with a 10G link to a system running 2.6.18-8.el5 (I was having difficulty getting cxgb3 going on my kernel.org kernels - firmware mismatches - so I booted RHEL5 there).

I put all four 1G interfaces into a balance_rr (mode=0) bond and started running just a single netperf TCP_STREAM test.

On the bonding side:

hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
    19050 segments retransmited
    9349 fast retransmits
    9698 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
          RX packets:50708119 errors:0 dropped:0 overruns:0 frame:0
          TX packets:58801285 errors:0 dropped:0 overruns:0 carrier:0
hpcpc103:~/net-2.6.24/Documentation/networking# netperf -H 192.168.5.106
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.106 (192.168.5.106) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 87380  16384  16384    10.01    1267.99
hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
    20268 segments retransmited
    9974 fast retransmits
    10291 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
          RX packets:51636421 errors:0 dropped:0 overruns:0 frame:0
          TX packets:59899089 errors:0 dropped:0 overruns:0 carrier:0

on the recieving side:

[EMAIL PROTECTED] ~]# ifconfig eth5 | grep pack
          RX packets:58802455 errors:0 dropped:0 overruns:0 frame:0
          TX packets:50205304 errors:0 dropped:0 overruns:0 carrier:0
[EMAIL PROTECTED] ~]# ifconfig eth5 | grep pack
          RX packets:59900267 errors:0 dropped:0 overruns:0 frame:0
          TX packets:51124138 errors:0 dropped:0 overruns:0 carrier:0

So, there were 20268 - 19050 or 1218 retransmissions during the test. The sending side reported sending 59899089 - 58801285 or 1097804 packets, and the receiver reported receiving 59900267 - 58802455 or 1097812 packets.

Unless the switch was only occasionally duplicating segments or something, it looks like all the retransmissions were the result of duplicate ACKs from packet reordering.

For grins I varied the "reordering" sysctl and got:

# netstat -s -t | grep retran; for i in 3 4 5 6 7 8 9 10 20 30; do sysctl -w net.ipv4.tcp_reordering=$i; netperf -H 192.168.5.106 -P 0 -B "reorder $i"; netstat -s -t | grep retran; done
    13735 segments retransmited
    6581 fast retransmits
    7151 forward retransmits
net.ipv4.tcp_reordering = 3
 87380  16384  16384    10.01    1294.51   reorder 3
    15127 segments retransmited
    7330 fast retransmits
    7794 forward retransmits
net.ipv4.tcp_reordering = 4
 87380  16384  16384    10.01    1304.22   reorder 4
    16103 segments retransmited
    7807 fast retransmits
    8293 forward retransmits
net.ipv4.tcp_reordering = 5
 87380  16384  16384    10.01    1330.88   reorder 5
    16763 segments retransmited
    8155 fast retransmits
    8605 forward retransmits
net.ipv4.tcp_reordering = 6
 87380  16384  16384    10.01    1350.50   reorder 6
    17134 segments retransmited
    8356 fast retransmits
    8775 forward retransmits
net.ipv4.tcp_reordering = 7
 87380  16384  16384    10.01    1353.00   reorder 7
    17492 segments retransmited
    8553 fast retransmits
    8936 forward retransmits
net.ipv4.tcp_reordering = 8
 87380  16384  16384    10.01    1358.00   reorder 8
    17649 segments retransmited
    8625 fast retransmits
    9021 forward retransmits
net.ipv4.tcp_reordering = 9
 87380  16384  16384    10.01    1415.89   reorder 9
    17736 segments retransmited
    8666 fast retransmits
    9067 forward retransmits
net.ipv4.tcp_reordering = 10
 87380  16384  16384    10.01    1412.36   reorder 10
    17773 segments retransmited
    8684 fast retransmits
    9086 forward retransmits
net.ipv4.tcp_reordering = 20
 87380  16384  16384    10.01    1403.47   reorder 20
    17773 segments retransmited
    8684 fast retransmits
    9086 forward retransmits
net.ipv4.tcp_reordering = 30
 87380  16384  16384    10.01    1325.41   reorder 30
    17773 segments retransmited
    8684 fast retransmits
    9086 forward retransmits

IE fast retrans from reordering until the reorder limit was reasonably well above the number of links in the aggregate.

As for how things got reordered, knuth knows exactly why. But it didn't need more than one connection, and that connection didn't have to vary the size of what it was passing to send(). Netperf was not making send calls which were an integral multiple of the MSS, which means that from time to time a short segment would be queued to an interface in the bond. Also, two of the dual-port NICs were on 66 MHz PCI-X busses, and the other two were on 133MHz PCI-X busses (four busses in all) so the DMA times will have differed.


And as if this mail wasn't already long enough, here is some tcptrace summary for the netperf data connection with reorder at 3:

================================
TCP connection 2:
        host c:        192.168.5.103:52264
        host d:        192.168.5.106:33940
        complete conn: yes
        first packet:  Fri Sep 28 14:06:43.271692 2007
        last packet:   Fri Sep 28 14:06:53.277018 2007
        elapsed time:  0:00:10.005326
        total packets: 1556191
        filename:      trace
   c->d:                              d->c:
     total packets:        699400           total packets:        856791
     ack pkts sent:        699399           ack pkts sent:        856791
     pure acks sent:            2           pure acks sent:       856789
     sack pkts sent:            0           sack pkts sent:       352480
     dsack pkts sent:           0           dsack pkts sent:         948
     max sack blks/ack:         0           max sack blks/ack:         3
     unique bytes sent: 1180423912           unique bytes sent:         0
     actual data pkts:     699397           actual data pkts:          0
     actual data bytes: 1180581744           actual data bytes:         0
     rexmt data pkts:         106           rexmt data pkts:           0
     rexmt data bytes:     157832           rexmt data bytes:          0
     zwnd probe pkts:           0           zwnd probe pkts:           0
     zwnd probe bytes:          0           zwnd probe bytes:          0
     outoforder pkts:      202461           outoforder pkts:           0
     pushed data pkts:       6057           pushed data pkts:          0
     SYN/FIN pkts sent:       1/1           SYN/FIN pkts sent:       1/1
     req 1323 ws/ts:          Y/Y           req 1323 ws/ts:          Y/Y
     adv wind scale:            7           adv wind scale:            9
     req sack:                  Y           req sack:                  Y
     sacks sent:                0           sacks sent:           352480
     urgent data pkts:          0 pkts      urgent data pkts:          0 pkts
     urgent data bytes:         0 bytes     urgent data bytes:         0 bytes
     mss requested:          1460 bytes     mss requested:          1460 bytes
     max segm size:          8688 bytes     max segm size:             0 bytes
     min segm size:             8 bytes     min segm size:             0 bytes
     avg segm size:          1687 bytes     avg segm size:             0 bytes
     max win adv:            5888 bytes     max win adv:          968704 bytes
     min win adv:            5888 bytes     min win adv:            8704 bytes
     zero win adv:              0 times     zero win adv:              0 times
     avg win adv:            5888 bytes     avg win adv:          798088 bytes
     initial window:         2896 bytes     initial window:            0 bytes
     initial window:            2 pkts      initial window:            0 pkts
     ttl stream length: 1577454360 bytes     ttl stream length:         0 bytes
     missed data:       397030448 bytes     missed data:               0 bytes
     truncated data:    1159600134 bytes     truncated data:            0 bytes
     truncated packets:    699383 pkts      truncated packets:         0 pkts
     data xmit time:       10.005 secs      data xmit time:        0.000 secs
     idletime max:            7.5 ms        idletime max:            7.4 ms
     throughput:        117979555 Bps       throughput:                0 Bps

This was taken at the receiving 10G NIC.

rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to