Hi all, We found a performance problem which occurs in heavy packet loss conditions. It seems there is a problem in detecting loss of retransmitted packets.
In the retransmission queue, status of sent packets are registered. When a packet is retransmitted, it is so marked, and snd_nxt (sequence number of the next new (non-retransmission) packet to be sent) at that moment is registered as ack_seq. A retransmitted packet is lost if it is not SACKed, and its ack_seq is smaller than the sequence number of any SACKed packet. An ACK packet can have up to three SACK blocks. A SACK block has a "start sequence number (start_seq)" and an "end sequence number (end_seq)" of received packets. In the current implementation of tcp_sacktag_write_queue(), if an ACK packet has multiple SACK blocks, the SACK blocks are sorted by the start_seq in an ascending order, and processed in the order. For scoreboarding packets in retransmission queue, the queue is scanned from the the snd_una (the lowest sequence number of not yet ACKed packets) to the end_seq of the SACK block. To optimize the scanning process, the next SACK block is processed not from the snd_una but from the end_seq of the previously processed SACK block. In the current implementation, for detecting the loss of retransmitted packets, the ack_seq of a retransmitted packet is compared with the end_seq of each SACK block during the scoreboarding. Therefore, a retransmitted packet which ack_seq is smaller than the end_seq of the last SACK block but larger than that of the currently being processed SACK block can not be detected as lost. Such undetected loss may eventually cause an RTO and performance may be degraded. PATCH #1 fixes this problem by comparing the the ack_seq with the largest end_seq of the SACK blocks. In addition, some of SACK blocks in an ACK packet may be already reported in preceding ACK packets. PATCH #2 optimizes processing by skipping such already reported SACK blocks. Usually, only the first SACK block of an ACK packet is the new one to be processed. Therefore, in most cases, applying PATCH #2 also solves the problem. However, to ensure accurate processing in case there are multiple new SACK blocks in an ACK packet, PATCH #2 should be applied in conjunction with PATCH #1. The experimental network is as follows: Node A ----> Router -------> Delay -------> Node B (Policing rate: emulator 500Mbps) (RTT: 20ms) You can find the detail of our experimental setting at http://projects.gtrc.aist.go.jp/gnet/sack-bug.html We transferred 1 GByte of data from Node A to Node B for ten times. Here is the performance comparison of the cases with and without these patches. Ave. goodput Ave. RTO 2.6.22 376 Mbps 26 PATCH#1 481 Mbps 0 PATCH#2 483 Mbps 0 In the vanilla kernel, several RTOs (TCPTimeouts + TCPSackRecoveryFail) occur. On the other hand, our patches eliminate RTOs and improve the average goodput by 28%. Any comments and ideas would be appreciated. Regards, Ryousei Takano - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html