> -----Original Message----- > From: maowenan > Sent: Tuesday, August 01, 2017 8:20 PM > To: 'Neal Cardwell'; David Miller > Cc: netdev@vger.kernel.org; Yuchung Cheng; Nandita Dukkipati > Subject: RE: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data > ACKed/SACKed > > > > > -----Original Message----- > > From: netdev-ow...@vger.kernel.org > > [mailto:netdev-ow...@vger.kernel.org] > > On Behalf Of Neal Cardwell > > Sent: Tuesday, August 01, 2017 10:58 AM > > To: David Miller > > Cc: netdev@vger.kernel.org; Neal Cardwell; Yuchung Cheng; Nandita > > Dukkipati > > Subject: [PATCH net 3/3] tcp: fix xmit timer to only be reset if data > > ACKed/SACKed > > > > Fix a TCP loss recovery performance bug raised recently on the netdev > > list, in two threads: > > > > (i) July 26, 2017: netdev thread "TCP fast retransmit issues" > > (ii) July 26, 2017: netdev thread: > > "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one > > outstanding TLP retransmission" > > > > The basic problem is that incoming TCP packets that did not indicate > > forward progress could cause the xmit timer (TLP or RTO) to be rearmed > > and pushed back in time. In certain corner cases this could result in > > the following problems noted in these threads: > > > > - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes > > could cause TCP to repeatedly schedule TLPs forever. We kept > > sending TLPs after every ~200ms, which elicited bogus SACKs, which > > caused more TLPs, ad infinitum; we never fired an RTO to fill in > > the holes. > > > > - Incoming data segments could, in some cases, cause us to reschedule > > our RTO or TLP timer further out in time, for no good reason. This > > could cause repeated inbound data to result in stalls in outbound > > data, in the presence of packet loss. > > > > This commit fixes these bugs by changing the TLP and RTO ACK processing to: > > > > (a) Only reschedule the xmit timer once per ACK. > > > > (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the > > ACK indicates sufficient forward progress (a packet was > > cumulatively ACKed, or we got a SACK for a packet that was sent > > before the most recent retransmit of the write queue head). > > > > This brings us back into closer compliance with the RFCs, since, as > > the comment for tcp_rearm_rto() notes, we should only restart the RTO > > timer after forward progress on the connection. Previously we were > > restarting the xmit timer even in these cases where there was no forward > progress. > > > > As a side benefit, this commit simplifies and speeds up the TCP timer > > arming logic. We had been calling inet_csk_reset_xmit_timer() three > > times on normal ACKs that cumulatively acknowledged some data: > > > > 1) Once near the top of tcp_ack() to switch from TLP timer to RTO: > > if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) > > tcp_rearm_rto(sk); > > > > 2) Once in tcp_clean_rtx_queue(), to update the RTO: > > if (flag & FLAG_ACKED) { > > tcp_rearm_rto(sk); > > > > 3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO > > to TLP: > > if (icsk->icsk_pending == ICSK_TIME_RETRANS) > > tcp_schedule_loss_probe(sk); > > > > This commit, by only rescheduling the xmit timer once per ACK, > > simplifies the code and reduces CPU overhead. > > > > This commit was tested in an A/B test with Google web server traffic. > > SNMP stats and request latency metrics were within noise levels, > > substantiating that for normal web traffic patterns this is a rare > > issue. This commit was also tested with packetdrill tests to verify > > that it fixes the timer behavior in the corner cases discussed in the netdev > threads mentioned above. > > > > This patch is a bug fix patch intended to be queued for -stable relases. > > > > Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") > > Reported-by: Klavs Klavsen <k...@vsen.dk> > > Reported-by: Mao Wenan <maowe...@huawei.com> > > Signed-off-by: Neal Cardwell <ncardw...@google.com> > > Signed-off-by: Yuchung Cheng <ych...@google.com> > > Signed-off-by: Nandita Dukkipati <nandi...@google.com> > > --- > > net/ipv4/tcp_input.c | 25 ++++++++++++++++--------- > > net/ipv4/tcp_output.c | 9 --------- > > 2 files changed, 16 insertions(+), 18 deletions(-) > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index > > 345febf0a46e..3e777cfbba56 100644 > > --- a/net/ipv4/tcp_input.c > > +++ b/net/ipv4/tcp_input.c > > @@ -107,6 +107,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = > HZ/2; > > #define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are > > (s)acked */ > > #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= > > FLAG_DATA_ACKED) */ > > #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info > > */ > > +#define FLAG_SET_XMIT_TIMER 0x1000 /* Set TLP or RTO timer */ > > #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a > sacked > > seq */ > > #define FLAG_UPDATE_TS_RECENT 0x4000 /* tcp_replace_ts_recent() */ > > #define FLAG_NO_CHALLENGE_ACK 0x8000 /* do not call > > tcp_send_challenge_ack() */ > > @@ -3016,6 +3017,13 @@ void tcp_rearm_rto(struct sock *sk) > > } > > } > > > > +/* Try to schedule a loss probe; if that doesn't work, then schedule > > +an RTO. */ static void tcp_set_xmit_timer(struct sock *sk) { > > + if (!tcp_schedule_loss_probe(sk)) > > + tcp_rearm_rto(sk); > > +} > > + > > /* If we get here, the whole TSO packet has not been acked. */ > > static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb) { @@ > > -3177,7 +3185,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int > prior_fackets, > > ca_rtt_us, sack->rate); > > > > if (flag & FLAG_ACKED) { > > - tcp_rearm_rto(sk); > > + flag |= FLAG_SET_XMIT_TIMER; /* set TLP or RTO timer */ > > if (unlikely(icsk->icsk_mtup.probe_size && > > !after(tp->mtu_probe.probe_seq_end, tp->snd_una))) > > { > > tcp_mtup_probe_success(sk); > > @@ -3205,7 +3213,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, > > int prior_fackets, > > * after when the head was last (re)transmitted. Otherwise the > > * timeout may continue to extend in loss recovery. > > */ > > - tcp_rearm_rto(sk); > > + flag |= FLAG_SET_XMIT_TIMER; /* set TLP or RTO timer */ > > } > > > > if (icsk->icsk_ca_ops->pkts_acked) { @@ -3577,9 +3585,6 @@ static > > int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) > > if (after(ack, tp->snd_nxt)) > > goto invalid_ack; > > > > - if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE) > > - tcp_rearm_rto(sk); > > - > > if (after(ack, prior_snd_una)) { > > flag |= FLAG_SND_UNA_ADVANCED; > > icsk->icsk_retransmits = 0; > > @@ -3644,18 +3649,20 @@ static int tcp_ack(struct sock *sk, const > > struct sk_buff *skb, int flag) > > flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una, &acked, > > &sack_state); > > > > + if (tp->tlp_high_seq) > > + tcp_process_tlp_ack(sk, ack, flag); > > + /* If needed, reset TLP/RTO timer; RACK may later override this. */ > [Mao Wenan] I have question about RACK, if there is no RACK feature in lower > version, who can clear this flag:FLAG_SET_XMIT_TIMER? > > > + if (flag & FLAG_SET_XMIT_TIMER) > > + tcp_set_xmit_timer(sk); > > + > > if (tcp_ack_is_dubious(sk, flag)) { > > is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | > FLAG_NOT_DUP)); > > tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit); > > } > > - if (tp->tlp_high_seq) > > - tcp_process_tlp_ack(sk, ack, flag); > > > > if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) > > sk_dst_confirm(sk); > > > > - if (icsk->icsk_pending == ICSK_TIME_RETRANS) > > - tcp_schedule_loss_probe(sk); > > delivered = tp->delivered - delivered; /* freshly ACKed or SACKed */ > > lost = tp->lost - lost; /* freshly marked lost */ > > tcp_rate_gen(sk, delivered, lost, sack_state.rate); diff --git > > a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index > > 0ae6b5d176c0..c99cba897b9c 100644 > > --- a/net/ipv4/tcp_output.c > > +++ b/net/ipv4/tcp_output.c > > @@ -2380,21 +2380,12 @@ bool tcp_schedule_loss_probe(struct sock *sk) > > u32 rtt = usecs_to_jiffies(tp->srtt_us >> 3); > > u32 timeout, rto_delta_us; > > > > - /* No consecutive loss probes. */ > > - if (WARN_ON(icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)) { > > - tcp_rearm_rto(sk); > > - return false; > > - } [Mao Wenan]Follow previous mail, in lower version such as 3.10, I found there are many timer type, e.g:ICSK_TIME_EARLY_RETRANS, RTO,PTO are used. I'm not sure there exist some unknown problem if we don't check isck_pending here and below if branch? And could you please post one lower version patch such as 3.10? Thanks a lot. #define ICSK_TIME_RETRANS 1 /* Retransmit timer */ #define ICSK_TIME_DACK 2 /* Delayed ack timer */ #define ICSK_TIME_PROBE0 3 /* Zero window probe timer */ #define ICSK_TIME_EARLY_RETRANS 4 /* Early retransmit timer */ #define ICSK_TIME_LOSS_PROBE 5 /* Tail loss probe timer */
> > /* Don't do any loss probe on a Fast Open connection before 3WHS > > * finishes. > > */ > > if (tp->fastopen_rsk) > > return false; > > > > - /* TLP is only scheduled when next timer event is RTO. */ > > - if (icsk->icsk_pending != ICSK_TIME_RETRANS) > > - return false; > > - > > /* Schedule a loss probe in 2*RTT for SACK capable connections > > * in Open state, that are either limited by cwnd or application. > > */ > > -- > > 2.14.0.rc0.400.g1c36432dff-goog