On Fri, 2017-02-17 at 01:20 -0500, Josh Hunt wrote: > Eric > > A team here was using the TCP_NOTSENT_LOWAT socket option and noticed that > more unsent data than they were expecting was sitting in the write queue. I > took a look and noticed that while we don't allow allocation of new skbs once > we exceed this value, we still allow adding data to the skb at the tail of the > write queue. In this context that means we could add up to size_goal to the > skb, which could be up to 64kb. > > The patch below attempts to put a cap on the amount we allow to write over > the TCP_NOTSENT_LOWAT value at 50%. In cases where the setting is smaller this > will allow the # of unsent bytes to more closely reflect the value. In cases > where the setting is 128kb or higher this will have no impact compared to the > current behavior. This should have two benefits: 1) finer-grain control of the > amount of unsent data, 2) reduction of TCP memory for values of > TCP_NOTSENT_LOWAT > < 128k. > > I reran the netperf results from your original commit with and without my > patch: > > 4.10.0-rc8: > # echo $(( 128 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat > # (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP > /proc/net/protocols > TCPv6 2064 2 21735 no 208 yes ipv6 y y y y y > y y y y y y y y n y y y y y > TCP 1912 465 21735 no 208 yes kernel y y y y y > y y y y y y y y n y y y y y > > # echo $(( 64 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat > # (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP > /proc/net/protocols > TCPv6 2064 2 19859 no 208 yes ipv6 y y y y y > y y y y y y y y n y y y y y > TCP 1912 465 19859 no 208 yes kernel y y y y y > y y y y y y y y n y y y y y > > 4.10.0-rc8 + patch: > # echo $(( 128 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat > # (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP > /proc/net/protocols > TCPv6 2064 2 21570 no 208 yes ipv6 y y y y y > y y y y y y y y n y y y y y > TCP 1912 465 21570 no 208 yes kernel y y y y y > y y y y y y y y n y y y y y > > # echo $(( 64 * 1024 )) > /proc/sys/net/ipv4/tcp_notsent_lowat > # (./super_netperf 200 -H remote -t TCP_STREAM -l 90 &); sleep 60; grep TCP > /proc/net/protocols > TCPv6 2064 2 18257 no 208 yes ipv6 y y y y y > y y y y y y y y n y y y y y > TCP 1912 465 18257 no 208 yes kernel y y y y y > y y y y y y y y n y y y y y > > I still need to do more testing, but wanted to get feedback on the idea. > > Josh >
This adds a cost to fast path. tcp_sendmsg() is insane. We have one skb granularity (64KB) already for SO_SNDBUF, regardless of TCP_NOTSENT_LOWAT being used or not. It makes no sense really to try so hard to add all these checks. I would prefer we fix the under run problem of TCP_NOTSENT_LOWAT Namely : SACKs can come, but we do not send EPOLLOUT, and we can starve the output or TLP Thanks