On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> wrote: > > A miscalculation for the number of acknowledged packets occurs during > RTO recovery whenever SACK is not enabled and a cumulative ACK covers > any non-retransmitted skbs. The reason is that pkts_acked value > calculated in tcp_clean_rtx_queue is not correct for slow start after > RTO as it may include segments that were not lost and therefore did > not need retransmissions in the slow start following the RTO. Then > tcp_slow_start will add the excess into cwnd bloating it and > triggering a burst. > > Instead, we want to pass only the number of retransmitted segments > that were covered by the cumulative ACK (and potentially newly sent > data segments too if the cumulative ACK covers that far). > > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > --- > net/ipv4/tcp_input.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 9a1b3c1..4a26c09 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > long seq_rtt_us = -1L; > long ca_rtt_us = -1L; > u32 pkts_acked = 0; > + u32 rexmit_acked = 0; > + u32 newdata_acked = 0; > u32 last_in_flight = 0; > bool rtt_update; > int flag = 0; > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > } > > if (unlikely(sacked & TCPCB_RETRANS)) { > - if (sacked & TCPCB_SACKED_RETRANS) > + if (sacked & TCPCB_SACKED_RETRANS) { > tp->retrans_out -= acked_pcount; > + rexmit_acked += acked_pcount; > + } > flag |= FLAG_RETRANS_DATA_ACKED; > } else if (!(sacked & TCPCB_SACKED_ACKED)) { > last_ackt = skb->skb_mstamp; > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > reord = start_seq; > if (!after(scb->end_seq, tp->high_seq)) > flag |= FLAG_ORIG_SACK_ACKED; > + else > + newdata_acked += acked_pcount; > } > > if (sacked & TCPCB_SACKED_ACKED) { > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > } > > if (tcp_is_reno(tp)) { > + /* Due to discontinuity on RTO in the artificial > + * sacked_out calculations, TCP must restrict > + * pkts_acked without SACK to rexmits and new data > + * segments > + */ > + if (icsk->icsk_ca_state == TCP_CA_Loss) > + pkts_acked = rexmit_acked + newdata_acked; > + My understanding is there are two problems
1) your fix: the reordering logic in tcp-remove_reno_sacks requires precise cumulatively acked count, not newly acked count? 2) current code: pkts_acked can substantially over-estimate the newly delivered pkts in both SACK and non-SACK cases. For example, let's say 99/100 packets are already sacked, and the next ACK acks 100 pkts. pkts_acked == 100 but really only one packet is delivered. It's wrong to inform congestion control that 100 packets have just delivered. AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as the newly delivered packets. A better fix for both SACK and non-SACK, seems to be moving ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is calibrated? this is what BBR is currently doing to avoid these pitfalls. > tcp_remove_reno_sacks(sk, pkts_acked); > } else { > int delta; > -- > 2.7.4 >