Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

Ilpo Järvinen Tue, 27 Mar 2018 07:25:15 -0700

On Mon, 26 Mar 2018, Yuchung Cheng wrote:

> On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen
> <[email protected]> wrote:
> >
> > A miscalculation for the number of acknowledged packets occurs during
> > RTO recovery whenever SACK is not enabled and a cumulative ACK covers
> > any non-retransmitted skbs. The reason is that pkts_acked value
> > calculated in tcp_clean_rtx_queue is not correct for slow start after
> > RTO as it may include segments that were not lost and therefore did
> > not need retransmissions in the slow start following the RTO. Then
> > tcp_slow_start will add the excess into cwnd bloating it and
> > triggering a burst.
> >
> > Instead, we want to pass only the number of retransmitted segments
> > that were covered by the cumulative ACK (and potentially newly sent
> > data segments too if the cumulative ACK covers that far).
> >
> > Signed-off-by: Ilpo Järvinen <[email protected]>
> > ---
> >  net/ipv4/tcp_input.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9a1b3c1..4a26c09 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> >         long seq_rtt_us = -1L;
> >         long ca_rtt_us = -1L;
> >         u32 pkts_acked = 0;
> > +       u32 rexmit_acked = 0;
> > +       u32 newdata_acked = 0;
> >         u32 last_in_flight = 0;
> >         bool rtt_update;
> >         int flag = 0;
> > @@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> >                 }
> >
> >                 if (unlikely(sacked & TCPCB_RETRANS)) {
> > -                       if (sacked & TCPCB_SACKED_RETRANS)
> > +                       if (sacked & TCPCB_SACKED_RETRANS) {
> >                                 tp->retrans_out -= acked_pcount;
> > +                               rexmit_acked += acked_pcount;
> > +                       }
> >                         flag |= FLAG_RETRANS_DATA_ACKED;
> >                 } else if (!(sacked & TCPCB_SACKED_ACKED)) {
> >                         last_ackt = skb->skb_mstamp;
> > @@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> >                                 reord = start_seq;
> >                         if (!after(scb->end_seq, tp->high_seq))
> >                                 flag |= FLAG_ORIG_SACK_ACKED;
> > +                       else
> > +                               newdata_acked += acked_pcount;
> >                 }
> >
> >                 if (sacked & TCPCB_SACKED_ACKED) {
> > @@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> > prior_fack,
> >                 }
> >
> >                 if (tcp_is_reno(tp)) {
> > +                       /* Due to discontinuity on RTO in the artificial
> > +                        * sacked_out calculations, TCP must restrict
> > +                        * pkts_acked without SACK to rexmits and new data
> > +                        * segments
> > +                        */
> > +                       if (icsk->icsk_ca_state == TCP_CA_Loss)
> > +                               pkts_acked = rexmit_acked + newdata_acked;
> > +
> My understanding is there are two problems
> 
> 1) your fix: the reordering logic in tcp-remove_reno_sacks requires
> precise cumulatively acked count, not newly acked count?


While I'm not entirely sure if you intented to say that my fix is broken 
or not, I thought this very difference alot while making the fix and I 
believe that this fix is needed because of the discontinuity at RTO 
(sacked_out is cleared as we set L-bits + lost_out). This is an artifact 
in the imitation of sacked_out for non-SACK but at RTO we can't keep that 
in sync because we set L-bits (and have no S-bits to guide us). Thus, we 
cannot anymore "use" those skbs with only L-bit for the reno_sacks logic.

In tcp_remove_reno_sacks acked - sacked_out is being used to calculate 
tp->delivered, using plain cumulative acked causes congestion control 
breakage later as call to tcp_cong_control will directly use the 
difference in tp->delivered.

This boils down the exact definition of tp->delivered (the one given in 
the header is not detailed enough). I guess you might have better idea
what it exactly is since one of you has added it? There are subtle things 
in the defination that can make it entirely unsuitable for cc decisions. 
Should those segments that we (possibly) already counted into 
tp->delivered during (potentially preceeding) CA_Recovery be added to it 
for _second time_ or not? This fix avoids such double counting (the 
price is that we might underestimate). For SACK+ACK losses, the similar 
question is: Should those segments that we missed counting into 
tp->delivered during preceeding CA_Recovery (due to losing enough SACKs) 
be added into tp->delivered now during RTO recovery or not (I'm not 
proposing we fix this unless we want to fix the both issues at the same 
time here as its impact with SACK is not that significant)? Is 
tp->delivered supposed to under- or overestimate (in the cases we're not 
sure what/when something happened)? ...If it's overestimating under any 
circumstances (for the current ACK), we cannot base our cc decision on it.

tcp_check_reno_reordering might like to have the cumulatively acked count 
but due to the forementioned discontinuity we anyway cannot accurately 
provide in CA_Loss what tcp_limit_reno_sacked expects (and the 
other CA states are unaffected by this fix). While we could call 
tcp_check_reno_reordering directly from tcp_clean_rtx_queue, it wouldn't 
remove the fundamental discontinuity problem that we have for what 
tcp_limit_reno_sacked assumes about sacked_out. It might actually be so
that tcp_limit_reno_sacked is just never going to work after we zero 
tp->sacked_out (this would actually be the problem #3) or at least ends 
up underestimating the reordering degree by the amount we cleared from 
sacked_out at RTO?

> 2) current code: pkts_acked can substantially over-estimate the newly
> delivered pkts in both SACK and non-SACK cases. For example, let's say
> 99/100 packets are already sacked, and the next ACK acks 100 pkts.
> pkts_acked == 100 but really only one packet is delivered. It's wrong
> to inform congestion control that 100 packets have just delivered.
> AFAICT, the CCs that have pkts_acked callbacks all treat pkts_acked as
> the newly delivered packets.
>
> A better fix for both SACK and non-SACK, seems to be moving
> ca_ops->pkts_acked into tcp_cong_control, where the "acked_sacked" is
> calibrated? this is what BBR is currently doing to avoid these pitfalls.

Unfortunately that would not fix the problematic calculation of 
tp->delivered.

But you might be right that there's a problem #2 (it's hard to notice 
for real though as most of the cc modules don't seem to use it for 
anything). However, this is for pkts_acked so are you sure what the 
cc modules exactly expect: the shrinkage in # of outstanding packets or 
the number of newly delivered packets?

Looking (only) quickly into the modules (that use it other than in 
CA_Open):
- Somewhat suspicious delayed ACK detection logic in cdg
- HTCP might want shrinkage in # of outstanding packets (but I'm not 
entirely sure) for throughput measurement
- I guess TCP illinois is broken (assumes it's newly delivered packets) 
but it would not need to use it at all as it just proxies the value in 
a local variable into tcp_illinois_cong_avoid that has the correct acked 
readily available.

There are separate cong_avoid and cong_control callbacks that are more 
obviously oriented for doing cc where as I'd think pkts_acked callback 
seems more oriented to for RTT-sample related operations. Therefore, I'm 
not entirely sure I this is what is wanted for pkts_acked, especially 
given the HTCP example.


-- 
 i.

Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery

Reply via email to