On Wed, Mar 28, 2018 at 7:14 AM, Yuchung Cheng <ych...@google.com> wrote: > > On Wed, Mar 28, 2018 at 5:45 AM, Ilpo Järvinen > <ilpo.jarvi...@helsinki.fi> wrote: > > On Tue, 27 Mar 2018, Yuchung Cheng wrote: > > > >> On Tue, Mar 27, 2018 at 7:23 AM, Ilpo Järvinen > >> <ilpo.jarvi...@helsinki.fi> wrote: > >> > On Mon, 26 Mar 2018, Yuchung Cheng wrote: > >> > > >> >> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > >> >> <ilpo.jarvi...@helsinki.fi> wrote: > >> >> > > >> >> > A miscalculation for the number of acknowledged packets occurs during > >> >> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers > >> >> > any non-retransmitted skbs. The reason is that pkts_acked value > >> >> > calculated in tcp_clean_rtx_queue is not correct for slow start after > >> >> > RTO as it may include segments that were not lost and therefore did > >> >> > not need retransmissions in the slow start following the RTO. Then > >> >> > tcp_slow_start will add the excess into cwnd bloating it and > >> >> > triggering a burst. > >> >> > > >> >> > Instead, we want to pass only the number of retransmitted segments > >> >> > that were covered by the cumulative ACK (and potentially newly sent > >> >> > data segments too if the cumulative ACK covers that far). > >> >> > > >> >> > Signed-off-by: Ilpo Järvinen <ilpo.jarvi...@helsinki.fi> > >> >> > --- > >> >> > >> >> My understanding is there are two problems > >> >> > >> >> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires > >> >> precise cumulatively acked count, not newly acked count? > >> > > >> > While I'm not entirely sure if you intented to say that my fix is broken > >> > or not, I thought this very difference alot while making the fix and I > >> > believe that this fix is needed because of the discontinuity at RTO > >> > (sacked_out is cleared as we set L-bits + lost_out). This is an artifact > >> > in the imitation of sacked_out for non-SACK but at RTO we can't keep that > >> > in sync because we set L-bits (and have no S-bits to guide us). Thus, we > >> > cannot anymore "use" those skbs with only L-bit for the reno_sacks logic. > >> > > >> > In tcp_remove_reno_sacks acked - sacked_out is being used to calculate > >> > tp->delivered, using plain cumulative acked causes congestion control > >> > breakage later as call to tcp_cong_control will directly use the > >> > difference in tp->delivered. > >> > > >> > This boils down the exact definition of tp->delivered (the one given in > >> > the header is not detailed enough). I guess you might have better idea > >> > what it exactly is since one of you has added it? There are subtle things > >> > in the defination that can make it entirely unsuitable for cc decisions. > >> > Should those segments that we (possibly) already counted into > >> > tp->delivered during (potentially preceeding) CA_Recovery be added to it > >> > for _second time_ or not? This fix avoids such double counting (the > >> Where is the double counting, assuming normal DUPACK behavior? > >> > >> In the non-sack case: > >> > >> 1. upon receiving a DUPACK, we assume one packet has been delivered by > >> incrementing tp->delivered in tcp_add_reno_sack() > > > > 1b. RTO here. We clear tp->sacked_out at RTO (i.e., the discontinuity > > I've tried to point out quite many times already)... > > > >> 2. upon receiving a partial ACK or an ACK that acks recovery point > >> (high_seq), tp->delivered is incremented by (cumulatively acked - > >> #dupacks) in tcp_remove_reno_sacks() > > > > ...and this won't happen correctly anymore after RTO (since non-SACK > > won't keep #dupacks due to the discontinuity). Thus we end up adding > > cumulatively acked - 0 to tp->delivered on those ACKs. > > > >> therefore tp->delivered is tracking the # of packets delivered > >> (sacked, acked, DUPACK'd) with the most information it could have > >> inferred. > > > > Since you didn't answer any of my questions about tp->delivered directly, > > let me rephrase them to this example (non-SACK, of course): > > > > 4 segments outstanding. RTO recovery underway (lost_out=4, sacked_out=0). > > Cwnd = 2 so the sender rexmits 2 out of 4. We get cumulative ACK for > > three segments. How much should tp->delivered be incremented? 2 or 3? > > > > ...I think 2 is the right answer. > > > >> From congestion control's perspective, it cares about the delivery > >> information (e.g. how much), not the sequences (what or how). > > > > I guess you must have missed my point. I'm talking about "how much" > > whole the time. It's just about when can we account for it (and when not). > > > >> I am pointing out that > >> > >> 1) your fix may fix one corner CC packet accounting issue in the > >> non-SACK and CA_Loss > >> 2) but it does not fix the other (major) CC packet accounting issue in > >> the SACK case > > > > You say major but it's major if and only if one is using one of the > > affected cc modules which were not that many and IMHO not very mainstream > > anyway (check my list from the previous email, not that I'd mind if they'd > > get fixed too). The rest of the cc modules do not seem to use the incorrect > > value even if some of them have the pkts_acked callback. > > > > Other CC packet accounting (regardless of SACK) is based on tp->delivered > > and tp->delivered is NOT similarly miscounted with SACK because the logic > > depend on !TCPCB_SACKED_ACKED (that fails only if we have very high ACK > > loss). > > > >> 3) it breaks the dynamic dupthresh / reordering in the non-SACK case > >> as tcp_check_reno_reordering requires strictly cum. ack. > > > > I think that the current non-SACK reordering detection logic is not that > > sound after RTO (but I'm yet to prove this). ...There seems to be some > > failure modes which is why I even thought of disabling the whole thing > > for non-SACK RTO recoveries and as it now seems you do care more about > > non-SACK than you initial claimed, I might even have motivation to fix > > more those corners rather than the worst bugs only ;-). ...But I'll make > > the tp->delivered fix only in this patch to avoid any change this part of > > the code. > > > >> Therefore I prefer > >> a) using tp->delivered to fix (1)(2) > >> b) perhaps improving tp->delivered SACK emulation code in the non-SACK case > > > > Somehow I get an impression that you might assume/say here that > [resending in plaintext] > That's wrong impression. Perhaps it's worth re-iterating what I agree > and disagree > > 1. [agree] there's accounting issue in non-SACK as you discovered > which causes CC misbehavior > > 2. [major disagree] adjusting pkts_acked for ca_ops->pkts_acked in non-sack > => that field is not used by common C.C. (you said so too) > > 3. [disagree] adjusting pkts_acked may not affect reordering > accounting in non-sack > > > For cubic or reno, the main source is the "acked_sacked" passed into > tcp_cong_avoid(). that variable is derived from tp->delivered. > Therefore we need to fix that to address the problem in (1) > > I have yet to read your code. Will read later today. Your patch looks good. Some questions / comments:
1. Why only apply to CA_Loss and not also CA_Recovery? this may mitigate GRO issue I noted in other thread. 2. minor style nit: can we adjust tp->delivered directly in tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion control when SACK is used. One day I'd like to rename all these *reno* to _nonsack_. Thanks > > > tp->delivered is all correct for non-SACK? ...It isn't without a fix. > > Therefore a) won't help any for non-SACK. Tp->delivered for non-SACK is > > currently (mis!)calculated in tcp_remove_reno_sacks because the incorrect > > pkts_acked is fed to it. ...Thus, b) is an intermediate step in the > > miscalculation chain I'm fixing with this change. B) resolves also (1) > > without additional changes to logic! > > > > I've included below the updated change that only fixes tp->delivered > > calculation (not tested besides compiling). ...But I think it's worse than > > my previous version because tcp_remove_reno_sacks then assumes > > non-sensical L|S skbs (but there seem to be other, worse limitations in > > sacked_out imitation after RTO because we've all skbs marked with L-bit so > > it's not that big deal for me as most of the that imitation code is no-op > > anyway after RTO). > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > index 9a1b3c1..e11748d 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -1868,7 +1868,6 @@ static void tcp_remove_reno_sacks(struct sock *sk, > > int acked) > > > > if (acked > 0) { > > /* One ACK acked hole. The rest eat duplicate ACKs. */ > > - tp->delivered += max_t(int, acked - tp->sacked_out, 1); > > if (acked - 1 >= tp->sacked_out) > > tp->sacked_out = 0; > > else > > @@ -1878,6 +1877,12 @@ static void tcp_remove_reno_sacks(struct sock *sk, > > int acked) > > tcp_verify_left_out(tp); > > } > > > > +static void tcp_update_reno_delivered(struct tcp_sock *tp, int acked) > > +{ > > + if (acked > 0) > > + tp->delivered += max_t(int, acked - tp->sacked_out, 1); > > +} > > + > > static inline void tcp_reset_reno_sack(struct tcp_sock *tp) > > { > > tp->sacked_out = 0; > > @@ -3027,6 +3032,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > long seq_rtt_us = -1L; > > long ca_rtt_us = -1L; > > u32 pkts_acked = 0; > > + u32 rexmit_acked = 0; > > + u32 newdata_acked = 0; > > u32 last_in_flight = 0; > > bool rtt_update; > > int flag = 0; > > @@ -3056,8 +3063,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > } > > > > if (unlikely(sacked & TCPCB_RETRANS)) { > > - if (sacked & TCPCB_SACKED_RETRANS) > > + if (sacked & TCPCB_SACKED_RETRANS) { > > tp->retrans_out -= acked_pcount; > > + rexmit_acked += acked_pcount; > > + } > > flag |= FLAG_RETRANS_DATA_ACKED; > > } else if (!(sacked & TCPCB_SACKED_ACKED)) { > > last_ackt = skb->skb_mstamp; > > @@ -3070,6 +3079,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > reord = start_seq; > > if (!after(scb->end_seq, tp->high_seq)) > > flag |= FLAG_ORIG_SACK_ACKED; > > + else > > + newdata_acked += acked_pcount; > > } > > > > if (sacked & TCPCB_SACKED_ACKED) { > > @@ -3151,6 +3162,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > > prior_fack, > > } > > > > if (tcp_is_reno(tp)) { > > + tcp_update_reno_delivered(tp, icsk->icsk_ca_state > > != TCP_CA_Loss ? > > + pkts_acked : > > + rexmit_acked + > > newdata_acked); > > tcp_remove_reno_sacks(sk, pkts_acked); > > } else { > > int delta;