Hi,

I am investigating a TCP stall that can occur when sending to an Android device 
(kernel 4.9.148) from an Ubuntu server running kernel 5.11.0.

The issue seems to be that RACK is not applied when a D-SACK (with SACK) is 
received on the server after an RTO re-transmission (CA_Loss state). Here the 
re-transmitted segment is considered to be already delivered and loss undo 
logic is applied. Then nothing is re-transmitted until the next RTO, where the 
next segment is sent and the same thing happens again. The causes the 
retransmitted segments to be delivered at a rate of ~1 per second, so a burst 
loss of eg. 20 segments cause a 20+ second stall. I would expect RACK to kick 
in long before this happens.

Note the D-SACK should not be considered spurious, as the TSecr value matches 
the re-transmission TSval.

Also, the Android receiver is definitely sending strange D-SACKs that does not 
properly advance the ACK number to include received segments. However, I can't 
control it and need to fix it on the server by quickly re-transmitting the 
segments. The connection itself is functional. If the client makes a request to 
the server in this state, it can respond and the client will receive any 
segments sent in reply.

I can see from counters that TcpExtTCPLossUndo & TcpExtTCPSackFailures are 
incremented on the server when this happens.
The issue appears both with F-RTO enabled and disabled. Also appears both with 
BBR and RENO.

Any idea of why this happens, or suggestions on how to debug the issue further?

/Gil

Reply via email to