On 19.02.2018 15:38, Neal Cardwell wrote:
On Sun, Feb 18, 2018 at 4:02 PM, Teodor Milkov <t...@del.bg> wrote:
Hello,

I've numerous reports from Windows users that after kernel upgrade from 4.9
to 4.14 they experienced major slow downs and transfer stalls.

After some digging, I found that the slowness starts with this commit:

  tcp: extend F-RTO to catch more spurious timeouts (89fe18e44)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=89fe18e44f7ee5ab1c90d0dff5835acee7751427

Which is partially reverted later with this one:

  tcp: restrict F-RTO to work-around broken middle-boxes (cc663f4d4)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc663f4d4c97b7297fb45135ab23cfd508b35a77

But, still, we had stalls until I fully reverted 89fe18e44.
Thanks for the report. Do you have any other details that might help
evaluate this issue?

I'm sorry I didn't provide more info. It was long session.

Any packet traces, by any chance?

I'll try and obtain one.

Were the affected connections web browsing, videos, file transfer, etc?

First reports were from pop3 users. When we asked them to try file transfer, the problem persisted.

It seems the slow down/stalls aren't severe enough to frustrate web browsers.

Were there non-Windows users in this population that did not seem to be
affected by the stalls?

All reports were from Windows users. I was able to partially reproduce the problem only using Windows as well. Linux & Mac OS X are apparently immune.

Was the bottleneck primarily Ethernet, wifi, cellular, cable modem, etc?

In my test case it is 100 Mbit/s long haul MAN (Ethernet, 1 ms) and there's 75 Mbit/s shaper on top of it set up by one of out ISPs. Not sure what kind of shaper/policer this is.

With 4.4 and 4.9 kernels as well as patched 4.14 I get very steady ~6 MB/s. Otherwise it's up to 3 MB/s with frequent slow downs bellow 500 KB/s and an average speed of about 1 MB/s.

Reporting customers were on all kinds of connectivity from cellular to cable, reporting regressions as low as 1 MByte/s (with good kernel) down to 50 KB/s. I suspect that the higher the rtt, the lower the speed.

Any middleboxes (firewall, NAT, etc) between the servers and users?

In my test there's Linux statefull firewall, yes. Not sure about other reporters.

Does "stall" mean that the connection permanently froze, or temporarily slowed 
down but eventually
recovered?
In most cases it is severe slow down, which eventually recovers. Occasionally there were complete freezes, but these are rather rare.

I've deployed 4.14.20 with 89fe18e44 completely reverted and so far feedback from customers is positive.

Thank you very much for your attention to this.

Reply via email to