On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> wrote: > On Tue, 27 Mar 2018, Yuchung Cheng wrote: > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen >> <ilpo.jarvi...@helsinki.fi> wrote: >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote: >> > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen >> >> <ilpo.jarvi...@helsinki.fi> wrote: >> >> > >> >> > A miscalculation for the number of acknowledged packets occurs during >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers >> >> > any non-retransmitted skbs. The reason is that pkts_acked value >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after >> >> > RTO as it may include segments that were not lost and therefore did >> >> > not need retransmissions in the slow start following the RTO. Then >> >> > tcp_slow_start will add the excess into cwnd bloating it and >> >> > triggering a burst. >> >> > >> >> > Instead, we want to pass only the number of retransmitted segments >> >> > that were covered by the cumulative ACK (and potentially newly sent >> >> > data segments too if the cumulative ACK covers that far). >> >> > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> >> >> > --- >> >> >> >> My understanding is there are two problems >> >> >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires >> >> precise cumulatively acked count, not newly acked count? >> > >> > While I'm not entirely sure if you intented to say that my fix is broken >> > or not, I thought this very difference alot while making the fix and I >> > believe that this fix is needed because of the discontinuity at RTO >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic. >> > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate >> > tp->delivered, using plain cumulative acked causes congestion control >> > breakage later as call to tcp_cong_control will directly use the >> > difference in tp->delivered. >> > >> > This boils down the exact definition of tp->delivered (the one given in >> > the header is not detailed enough). I guess you might have better idea >> > what it exactly is since one of you has added it? There are subtle things >> > in the defination that can make it entirely unsuitable for cc decisions. >> > Should those segments that we (possibly) already counted into >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it >> > for _second time_ or not? This fix avoids such double counting (the >> Where is the double counting, assuming normal DUPACK behavior? >> >> In the non-sack case: >> >> 1. upon receiving a DUPACK, we assume one packet has been delivered by >> incrementing tp->delivered in tcp_add_reno_sack() > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity > I've tried to point out quite many times already)... > >> 2. upon receiving a partial ACK or an ACK that acks recovery point >> (high_seq), tp->delivered is incremented by (cumulatively acked - >> #dupacks) in tcp_remove_reno_sacks() > > ...and this won't happen correctly anymore after RTO (since non-SACK > won't keep #dupacks due to the discontinuity). Thus we end up adding > cumulatively acked - 0 to tp->delivered on those ACKs. > >> therefore tp->delivered is tracking the # of packets delivered >> (sacked, acked, DUPACK'd) with the most information it could have >> inferred. > > Since you didn't answer any of my questions about tp->delivered directly, > let me rephrase them to this example (non-SACK, of course): > > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0). > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for > three segments. How much should tp->delivered be incremented? 2 or 3? > > ...I think 2 is the right answer. > >> From congestion control's perspective, it cares about the delivery >> information (e.g. how much), not the sequences (what or how). > > I guess you must have missed my point. I'm talking about "how much" > whole the time. It's just about when can we account for it (and when not). > >> I am pointing out that >> >> 1) your fix may fix one corner CC packet accounting issue in the >> non-SACK and CA_Loss >> 2) but it does not fix the other (major) CC packet accounting issue in >> the SACK case > > You say major but it's major if and only if one is using one of the > affected cc modules which were not that many and IMHO not very mainstream > anyway (check my list from the previous email, not that I'd mind if they'd > get fixed too). The rest of the cc modules do not seem to use the incorrect > value even if some of them have the pkts_acked callback. > > Other CC packet accounting (regardless of SACK) is based on tp->delivered > and tp->delivered is NOT similarly miscounted with SACK because the logic > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK > loss). > >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case >> as tcp_check_reno_reordering requires strictly cum. ack. > > I think that the current non-SACK reordering detection logic is not that > sound after RTO (but I'm yet to prove this). ...There seems to be some > failure modes which is why I even thought of disabling the whole thing > for non-SACK RTO recoveries and as it now seems you do care more about > non-SACK than you initial claimed, I might even have motivation to fix > more those corners rather than the worst bugs only ;-). ...But I'll make > the tp->delivered fix only in this patch to avoid any change this part of > the code. > >> Therefore I prefer >> a) using tp->delivered to fix (1)(2) >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case > > Somehow I get an impression that you might assume/say here that [resending in plaintext] That's wrong impression. Perhaps it's worth re-iterating what I agree and disagree
1. [agree] there's accounting issue in non-SACK as you discovered which causes CC misbehavior 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack => that field is not used by common C.C. (you said so too) 3. [disagree] adjusting pkts_acked may not affect reordering accounting in non-sack For cubic or reno, the main source is the "acked_sacked" passed into tcp_cong_avoid(). that variable is derived from tp->delivered. Therefore we need to fix that to address the problem in (1) I have yet to read your code. Will read later today. > tp->delivered is all correct for non-SACK? ...It isn't without a fix. > Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is > currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect > pkts_acked is fed to it. ...Thus, b) is an intermediate step in the > miscalculation chain I'm fixing with this change. B) resolves also (1) > without additional changes to logic! > > I've included below the updated change that only fixes tp->delivered > calculation (not tested besides compiling). ...But I think it's worse than > my previous version because tcp_remove_reno_sacks then assumes > non-sensical L|S skbs (but there seem to be other, worse limitations in > sacked_out imitation after RTO because we've all skbs marked with L-bit so > it's not that big deal for me as most of the that imitation code is no-op > anyway after RTO). > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 9a1b3c1..e11748d 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, int > acked) > > if (acked > 0) { > /* One ACK acked hole. The rest eat duplicate ACKs. */ > - tp->delivered += max_t(int, acked - tp->sacked_out, 1); > if (acked - 1 >= tp->sacked_out) > tp->sacked_out = 0; > else > @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, int > acked) > tcp_verify_left_out(tp); > } > > +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked) > +{ > + if (acked > 0) > + tp->delivered += max_t(int, acked - tp->sacked_out, 1); > +} > + > static inline void tcp_reset_reno_sack(struct tcp_sock *tp) > { > tp->sacked_out = 0; > @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > long seq_rtt_us = -1L; > long ca_rtt_us = -1L; > u32 pkts_acked = 0; > + u32 rexmit_acked = 0; > + u32 newdata_acked = 0; > u32 last_in_flight = 0; > bool rtt_update; > int flag = 0; > @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > } > > if (unlikely(sacked & TCPCB_RETRANS)) { > - if (sacked & TCPCB_SACKED_RETRANS) > + if (sacked & TCPCB_SACKED_RETRANS) { > tp->retrans_out -= acked_pcount; > + rexmit_acked += acked_pcount; > + } > flag |= FLAG_RETRANS_DATA_ACKED; > } else if (!(sacked & TCPCB_SACKED_ACKED)) { > last_ackt = skb->skb_mstamp; > @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > reord = start_seq; > if (!after(scb->end_seq, tp->high_seq)) > flag |= FLAG_ORIG_SACK_ACKED; > + else > + newdata_acked += acked_pcount; > } > > if (sacked & TCPCB_SACKED_ACKED) { > @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > } > > if (tcp_is_reno(tp)) { > + tcp_update_reno_delivered(tp, icsk->icsk_ca_state != > TCP_CA_Loss ? > + pkts_acked : > + rexmit_acked + > newdata_acked); > tcp_remove_reno_sacks(sk, pkts_acked); > } else { > int delta;