On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> wrote: > On Mon, 26 Mar 2018, Yuchung Cheng wrote: > >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen >> <ilpo.jarvi...@helsinki.fi> wrote: >> > >> > A miscalculation for the number of acknowledged packets occurs during >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers >> > any non-retransmitted skbs. The reason is that pkts_acked value >> > calculated in tcp_clean_rtx_queue is not correct for slow start after >> > RTO as it may include segments that were not lost and therefore did >> > not need retransmissions in the slow start following the RTO. Then >> > tcp_slow_start will add the excess into cwnd bloating it and >> > triggering a burst. >> > >> > Instead, we want to pass only the number of retransmitted segments >> > that were covered by the cumulative ACK (and potentially newly sent >> > data segments too if the cumulative ACK covers that far). >> > >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> >> > --- >> > net/ipv4/tcp_input.c | 16 +++++++++++++++- >> > 1 file changed, 15 insertions(+), 1 deletion(-) >> > >> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c >> > index 9a1b3c1..4a26c09 100644 >> > --- a/net/ipv4/tcp_input.c >> > +++ b/net/ipv4/tcp_input.c >> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 >> > prior_fack, >> > long seq_rtt_us = -1L; >> > long ca_rtt_us = -1L; >> > u32 pkts_acked = 0; >> > + u32 rexmit_acked = 0; >> > + u32 newdata_acked = 0; >> > u32 last_in_flight = 0; >> > bool rtt_update; >> > int flag = 0; >> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 >> > prior_fack, >> > } >> > >> > if (unlikely(sacked & TCPCB_RETRANS)) { >> > - if (sacked & TCPCB_SACKED_RETRANS) >> > + if (sacked & TCPCB_SACKED_RETRANS) { >> > tp->retrans_out -= acked_pcount; >> > + rexmit_acked += acked_pcount; >> > + } >> > flag |= FLAG_RETRANS_DATA_ACKED; >> > } else if (!(sacked & TCPCB_SACKED_ACKED)) { >> > last_ackt = skb->skb_mstamp; >> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 >> > prior_fack, >> > reord = start_seq; >> > if (!after(scb->end_seq, tp->high_seq)) >> > flag |= FLAG_ORIG_SACK_ACKED; >> > + else >> > + newdata_acked += acked_pcount; >> > } >> > >> > if (sacked & TCPCB_SACKED_ACKED) { >> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 >> > prior_fack, >> > } >> > >> > if (tcp_is_reno(tp)) { >> > + /* Due to discontinuity on RTO in the artificial >> > + * sacked_out calculations, TCP must restrict >> > + * pkts_acked without SACK to rexmits and new data >> > + * segments >> > + */ >> > + if (icsk->icsk_ca_state == TCP_CA_Loss) >> > + pkts_acked = rexmit_acked + newdata_acked; >> > + >> My understanding is there are two problems >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires >> precise cumulatively acked count, not newly acked count? > > While I'm not entirely sure if you intented to say that my fix is broken > or not, I thought this very difference alot while making the fix and I > believe that this fix is needed because of the discontinuity at RTO > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact > in the imitation of sacked_out for non-SACK but at RTO we can't keep that > in sync because we set L-bits (and have no S-bits to guide us). Thus, we > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic. > > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate > tp->delivered, using plain cumulative acked causes congestion control > breakage later as call to tcp_cong_control will directly use the > difference in tp->delivered. > > This boils down the exact definition of tp->delivered (the one given in > the header is not detailed enough). I guess you might have better idea > what it exactly is since one of you has added it? There are subtle things > in the defination that can make it entirely unsuitable for cc decisions. > Should those segments that we (possibly) already counted into > tp->delivered during (potentially preceeding) CA_Recovery be added to it > for _second time_ or not? This fix avoids such double counting (the Where is the double counting, assuming normal DUPACK behavior?
In the non-sack case: 1. upon receiving a DUPACK, we assume one packet has been delivered by incrementing tp->delivered in tcp_add_reno_sack() 2. upon receiving a partial ACK or an ACK that acks recovery point (high_seq), tp->delivered is incremented by (cumulatively acked - #dupacks) in tcp_remove_reno_sacks() therefore tp->delivered is tracking the # of packets delivered (sacked, acked, DUPACK'd) with the most information it could have inferred. >From congestion control's perspective, it cares about the delivery information (e.g. how much), not the sequences (what or how). I am pointing out that 1) your fix may fix one corner CC packet accounting issue in the non-SACK and CA_Loss 2) but it does not fix the other (major) CC packet accounting issue in the SACK case 3) it breaks the dynamic dupthresh / reordering in the non-SACK case as tcp_check_reno_reordering requires strictly cum. ack. Therefore I prefer a) using tp->delivered to fix (1)(2) b) perhaps improving tp->delivered SACK emulation code in the non-SACK case > price is that we might underestimate). For SACK+ACK losses, the similar > question is: Should those segments that we missed counting into > tp->delivered during preceeding CA_Recovery (due to losing enough SACKs) > be added into tp->delivered now during RTO recovery or not (I'm not > proposing we fix this unless we want to fix the both issues at the same > time here as its impact with SACK is not that significant)? Is > tp->delivered supposed to under- or overestimate (in the cases we're not > sure what/when something happened)? ...If it's overestimating under any > circumstances (for the current ACK), we cannot base our cc decision on it. > > tcp_check_reno_reordering might like to have the cumulatively acked count > but due to the forementioned discontinuity we anyway cannot accurately > provide in CA_Loss what tcp_limit_reno_sacked expects (and the > other CA states are unaffected by this fix). While we could call > tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't > remove the fundamental discontinuity problem that we have for what > tcp_limit_reno_sacked assumes about sacked_out. It might actually be so > that tcp_limit_reno_sacked is just never going to work after we zero > tp->sacked_out (this would actually be the problem #3) or at least ends > up underestimating the reordering degree by the amount we cleared from > sacked_out at RTO? > >> 2) current code: pkts_acked can substantially over-estimate the newly >> delivered pkts in both SACK and non-SACK cases. For example, let's say >> 99/100 packets are already sacked, and the next ACK acks 100 pkts. >> pkts_acked == 100 but really only one packet is delivered. It's wrong >> to inform congestion control that 100 packets have just delivered. >> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as >> the newly delivered packets. >> >> A better fix for both SACK and non-SACK, seems to be moving >> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is >> calibrated? this is what BBR is currently doing to avoid these pitfalls. > > Unfortunately that would not fix the problematic calculation of > tp->delivered. > > But you might be right that there's a problem #2 (it's hard to notice > for real though as most of the cc modules don't seem to use it for > anything). However, this is for pkts_acked so are you sure what the > cc modules exactly expect: the shrinkage in # of outstanding packets or > the number of newly delivered packets? > > Looking (only) quickly into the modules (that use it other than in > CA_Open): > - Somewhat suspicious delayed ACK detection logic in cdg > - HTCP might want shrinkage in # of outstanding packets (but I'm not > entirely sure) for throughput measurement > - I guess TCP illinois is broken (assumes it's newly delivered packets) > but it would not need to use it at all as it just proxies the value in > a local variable into tcp_illinois_cong_avoid that has the correct acked > readily available. > > There are separate cong_avoid and cong_control callbacks that are more > obviously oriented for doing cc where as I'd think pkts_acked callback > seems more oriented to for RTT-sample related operations. Therefore, I'm > not entirely sure I this is what is wanted for pkts_acked, especially > given the HTCP example. > > > -- > i.