On Mon, 26 Mar 2018, Yuchung Cheng wrote: > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > > > A miscalculation for the number of acknowledged packets occurs during > > RTO recovery whenever SACK is not enabled and a cumulative ACK covers > > any non-retransmitted skbs. The reason is that pkts_acked value > > calculated in tcp_clean_rtx_queue is not correct for slow start after > > RTO as it may include segments that were not lost and therefore did > > not need retransmissions in the slow start following the RTO. Then > > tcp_slow_start will add the excess into cwnd bloating it and > > triggering a burst. > > > > Instead, we want to pass only the number of retransmitted segments > > that were covered by the cumulative ACK (and potentially newly sent > > data segments too if the cumulative ACK covers that far). > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > > --- > > net/ipv4/tcp_input.c | 16 +++++++++++++++- > > 1 file changed, 15 insertions(+), 1 deletion(-) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 9a1b3c1..4a26c09 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > long seq_rtt_us = -1L; > > long ca_rtt_us = -1L; > > u32 pkts_acked = 0; > > + u32 rexmit_acked = 0; > > + u32 newdata_acked = 0; > > u32 last_in_flight = 0; > > bool rtt_update; > > int flag = 0; > > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > } > > > > if (unlikely(sacked & TCPCB_RETRANS)) { > > - if (sacked & TCPCB_SACKED_RETRANS) > > + if (sacked & TCPCB_SACKED_RETRANS) { > > tp->retrans_out -= acked_pcount; > > + rexmit_acked += acked_pcount; > > + } > > flag |= FLAG_RETRANS_DATA_ACKED; > > } else if (!(sacked & TCPCB_SACKED_ACKED)) { > > last_ackt = skb->skb_mstamp; > > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > reord = start_seq; > > if (!after(scb->end_seq, tp->high_seq)) > > flag |= FLAG_ORIG_SACK_ACKED; > > + else > > + newdata_acked += acked_pcount; > > } > > > > if (sacked & TCPCB_SACKED_ACKED) { > > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > } > > > > if (tcp_is_reno(tp)) { > > + /* Due to discontinuity on RTO in the artificial > > + * sacked_out calculations, TCP must restrict > > + * pkts_acked without SACK to rexmits and new data > > + * segments > > + */ > > + if (icsk->icsk_ca_state == TCP_CA_Loss) > > + pkts_acked = rexmit_acked + newdata_acked; > > + > My understanding is there are two problems > > 1) your fix: the reordering logic in tcp-remove_reno_sacks requires > precise cumulatively acked count, not newly acked count?
While I'm not entirely sure if you intented to say that my fix is broken or not, I thought this very difference alot while making the fix and I believe that this fix is needed because of the discontinuity at RTO (sacked_out is cleared as we set L-bits + lost_out). This is an artifact in the imitation of sacked_out for non-SACK but at RTO we can't keep that in sync because we set L-bits (and have no S-bits to guide us). Thus, we cannot anymore "use" those skbs with only L-bit for the reno_sacks logic. In tcp_remove_reno_sacks acked - sacked_out is being used to calculate tp->delivered, using plain cumulative acked causes congestion control breakage later as call to tcp_cong_control will directly use the difference in tp->delivered. This boils down the exact definition of tp->delivered (the one given in the header is not detailed enough). I guess you might have better idea what it exactly is since one of you has added it? There are subtle things in the defination that can make it entirely unsuitable for cc decisions. Should those segments that we (possibly) already counted into tp->delivered during (potentially preceeding) CA_Recovery be added to it for _second time_ or not? This fix avoids such double counting (the price is that we might underestimate). For SACK+ACK losses, the similar question is: Should those segments that we missed counting into tp->delivered during preceeding CA_Recovery (due to losing enough SACKs) be added into tp->delivered now during RTO recovery or not (I'm not proposing we fix this unless we want to fix the both issues at the same time here as its impact with SACK is not that significant)? Is tp->delivered supposed to under- or overestimate (in the cases we're not sure what/when something happened)? ...If it's overestimating under any circumstances (for the current ACK), we cannot base our cc decision on it. tcp_check_reno_reordering might like to have the cumulatively acked count but due to the forementioned discontinuity we anyway cannot accurately provide in CA_Loss what tcp_limit_reno_sacked expects (and the other CA states are unaffected by this fix). While we could call tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't remove the fundamental discontinuity problem that we have for what tcp_limit_reno_sacked assumes about sacked_out. It might actually be so that tcp_limit_reno_sacked is just never going to work after we zero tp->sacked_out (this would actually be the problem #3) or at least ends up underestimating the reordering degree by the amount we cleared from sacked_out at RTO? > 2) current code: pkts_acked can substantially over-estimate the newly > delivered pkts in both SACK and non-SACK cases. For example, let's say > 99/100 packets are already sacked, and the next ACK acks 100 pkts. > pkts_acked == 100 but really only one packet is delivered. It's wrong > to inform congestion control that 100 packets have just delivered. > AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as > the newly delivered packets. > > A better fix for both SACK and non-SACK, seems to be moving > ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is > calibrated? this is what BBR is currently doing to avoid these pitfalls. Unfortunately that would not fix the problematic calculation of tp->delivered. But you might be right that there's a problem #2 (it's hard to notice for real though as most of the cc modules don't seem to use it for anything). However, this is for pkts_acked so are you sure what the cc modules exactly expect: the shrinkage in # of outstanding packets or the number of newly delivered packets? Looking (only) quickly into the modules (that use it other than in CA_Open): - Somewhat suspicious delayed ACK detection logic in cdg - HTCP might want shrinkage in # of outstanding packets (but I'm not entirely sure) for throughput measurement - I guess TCP illinois is broken (assumes it's newly delivered packets) but it would not need to use it at all as it just proxies the value in a local variable into tcp_illinois_cong_avoid that has the correct acked readily available. There are separate cong_avoid and cong_control callbacks that are more obviously oriented for doing cc where as I'd think pkts_acked callback seems more oriented to for RTT-sample related operations. Therefore, I'm not entirely sure I this is what is wanted for pkts_acked, especially given the HTCP example. -- i.