On Thu, 30 Mar 2006, Mark Butler wrote:

> David S. Miller wrote:
> 
> >This has been this way for centuries and it's the correct behavior.
> >
> >We double it on the way in to account for "struct sk_buff" etc.
> >overhead, applications assume that the SO_RCVBUF setting they make
> >will allow that much actual data to be received on that socket.
> >Applications are unaware that "struct sk_buff" and other overheads
> >allocate from the receive buffer during socket buffer allocation.
> >
> >And after considering the possible alternatives, returning the value
> >we actually used on get is the most desirable behavior.
> 
> Doubling the value passed via setsockopt(..., SO_RCVBUF,...) makes 
> perfect sense.

I don't think it makes perfect sense.  If there's overhead, fine go
ahead and add the overhead, but do it under the covers and invisible
to the user.  And doubling definitely doesn't make sense.  For example,
on a 10-Gbps transcontinental link with a 90 ms RTT, the sender
SO_SNDBUF and receiver SO_RCVBUF should be the BW*RTT product, which
in MB is 0.090*10000000000/1024/1024/8 = 107 MB.  Doubling that gives
107 MB for overhead, which seems a mite excessive (and there are paths
in active use with double or more that RTT).

>                 But what is the rationale for returning the doubled 
> value back in getsockopt(..., SO_RCVBUF, ....)?
> 
> All it appears to do is make applications believe / report they have 
> more buffer space than is actually available.

I definitely agree with this part.  The user only cares that their
application actually obtained the amount of buffer space they requested
for real user data, and not how much kernel overhead was required for
managing that buffer space.

Further complicating matters is that you don't actually even get what
you requested when it comes to the receive window that's actually
advertised on the network wire.  Earlier kernels would only give you
a receive window that was 3/4 the requested SO_RCVBUF, so to get the
desired optimum network performance you would have to multiply your
desired SO_RCVBUF by 4/3.

The 2.6.15.4 kernel I am currently running is even funkier.  It advertises
a fixed value for the receive window (scaled by the window scale factor)
regardless of the requested SO_RCVBUF.

Here's a test with a 80 MB requested receiver SO_RCVBUF
(and also an 80 MB sender SO_SNDBUF):

chance4 (192.168.88.8) -> chance5 (192.168.88.9):

[EMAIL PROTECTED] nuttcp -w80m 192.168.88.9
 6069.0625 MB /  10.01 sec = 5086.1838 Mbps 100 %TX 74 %RX

tcpdump of beginning of transfer showing wscale is 12:

tcpdump: listening on eth0
01:01:20.490078 192.168.88.8.44379 > 192.168.88.9.5001: S [tcp sum ok] 
2540322474:2540322474(0) win 17920 <mss 8960,sackOK,timestamp 410221719 
0,nop,wscale 12>(DF) (ttl 64, id 16957, len 60)
01:01:20.492120 192.168.88.9.5001 > 192.168.88.8.44379: S [tcp sum ok] 
2569611102:2569611102(0) ack 2540322475 win 17896 <mss 8960,sackOK,timestamp 
410302705 410221719,nop,wscale 12> (DF) (ttl 64, id 0, len 60)
...

tcpdump near the end of transfer showing advertised receive window:

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:01:24.563081 192.168.88.9.5001 > 192.168.88.8.44379: . [tcp sum ok] 1:1(0) 
ack 4294005300 win 19203 <nop,nop,timestamp 410303112 410222126> (DF) (ttl 64, 
id48880, len 52)
...

So the advertised receive window is 19203*2^12/1024^2 = 75 MB.

Now here's a test with a 100 MB requested SO_RCVBUF:

[EMAIL PROTECTED] nuttcp -w100m 192.168.88.9
 5996.7500 MB /  10.02 sec = 5020.6207 Mbps 100 %TX 75 %RX

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
01:10:49.202097 192.168.88.8.53177 > 192.168.88.9.5001: S [tcp sum ok] 
3122099198:3122099198(0) win 17920 <mss 8960,sackOK,timestamp 410278583 
0,nop,wscale 12>(DF) (ttl 64, id 10569, len 60)
01:10:49.204184 192.168.88.9.5001 > 192.168.88.8.53177: S [tcp sum ok] 
3164733525:3164733525(0) ack 3122099199 win 17896 <mss 8960,sackOK,timestamp 
410359569 410278583,nop,wscale 12> (DF) (ttl 64, id 0, len 60)
...

Still a wscale of 12.

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:10:54.835437 192.168.88.9.5001 > 192.168.88.8.53177: . [tcp sum ok] 1:1(0) 
ack 4294041092 win 19203 <nop,nop,timestamp 410360132 410279146> (DF) (ttl 64, 
id34999, len 52)
...

Hmmm, that same "win 19203", giving a 75 MB advertised window,
compared with the requested 100 MB.

And here's a test with a 60 MB requested SO_RCVBUF:

[EMAIL PROTECTED] nuttcp -w60m 192.168.88.9
 6229.3750 MB /  10.02 sec = 5215.3106 Mbps 100 %TX 77 %RX

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
01:13:58.522721 192.168.88.8.40883 > 192.168.88.9.5001: S [tcp sum ok] 
3319987801:3319987801(0) win 17920 <mss 8960,sackOK,timestamp 410297513 
0,nop,wscale 12>(DF) (ttl 64, id 23280, len 60)
01:13:58.524777 192.168.88.9.5001 > 192.168.88.8.40883: S [tcp sum ok] 
3367196353:3367196353(0) ack 3319987802 win 17896 <mss 8960,sackOK,timestamp 
410378499 410297513,nop,wscale 12> (DF) (ttl 64, id 0, len 60)

Again still a wscale of 12.

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:14:04.356838 192.168.88.9.5001 > 192.168.88.8.40883: . [tcp sum ok] 1:1(0) 
ack 4293936616 win 19203 <nop,nop,timestamp 410379082 410298096> (DF) (ttl 64, 
id2990, len 52)
...

And again, that same magic "win 19203", giving a 75 MB advertised window,
compared with the requested 60 MB.

How about a test with a 200 MB requested SO_RCVBUF:

[EMAIL PROTECTED] nuttcp -w200m 192.168.88.9
 6237.5000 MB /  10.02 sec = 5222.1114 Mbps 100 %TX 81 %RX

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
01:34:40.772181 192.168.88.8.56183 > 192.168.88.9.5001: S [tcp sum ok] 
334680817:334680817(0) win 17920 <mss 8960,sackOK,timestamp 410421724 
0,nop,wscale 12> (DF) (ttl 64, id 23341, len 60)
01:34:40.774228 192.168.88.9.5001 > 192.168.88.8.56183: S [tcp sum ok] 
383386021:383386021(0) ack 334680818 win 17896 <mss 8960,sackOK,timestamp 
410502709 410421724,nop,wscale 12> (DF) (ttl 64, id 0, len 60)

Surprisingly the wscale is still 12.

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:34:46.892855 192.168.88.9.5001 > 192.168.88.8.56183: . [tcp sum ok] 1:1(0) 
ack 4293915820 win 19203 <nop,nop,timestamp 410503320 410422336> (DF) (ttl 64, 
id18059, len 52)
...

And still, the same magic "win 19203", giving a 75 MB advertised window,
compared with the requested 200 MB.

Finally a test with a 40 MB requested SO_RCVBUF:

[EMAIL PROTECTED] nuttcp -w40m 192.168.88.9
 6021.4375 MB /  10.01 sec = 5046.2883 Mbps 100 %TX 73 %RX

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
01:39:29.428618 192.168.88.8.55019 > 192.168.88.9.5001: S [tcp sum ok] 
652638742:652638742(0) win 17920 <mss 8960,sackOK,timestamp 410450587 
0,nop,wscale 12> (DF) (ttl 64, id 32909, len 60)
01:39:29.430674 192.168.88.9.5001 > 192.168.88.8.55019: S [tcp sum ok] 
692929478:692929478(0) ack 652638743 win 17896 <mss 8960,sackOK,timestamp 
410531571 410450587,nop,wscale 12> (DF) (ttl 64, id 0, len 60)

Interestingly the wscale is still 12.

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:39:35.251992 192.168.88.9.5001 > 192.168.88.8.55019: . [tcp sum ok] 1:1(0) 
ack 4293999252 win 15358 <nop,nop,timestamp 410532153 410451169> (DF) (ttl 64, 
id39091, len 52)
...

Hmmm, finally a different win value than 19203.  But a "win 15258" with
a wscale of 12 gives an advertised receive window of 15358*2^12/1024^2 = 60 MB,
compared with the requested 40 MB.

It's all black magic to me.

                                                -Bill



Verbose nuttcp runs with various requested SO_SNDBUF/SO_RCVBUF sizes:

[EMAIL PROTECTED] nuttcp -v -w40m 192.168.88.9
nuttcp-t: v5.1.12: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 192.168.88.9
nuttcp-t: time limit = 10.00 seconds
nuttcp-t: connect to 192.168.88.9
nuttcp-t: send window size = 83886080, receive window size = 524288
nuttcp-t: 6297.2500 MB in 10.01 real seconds = 644216.22 KB/sec = 5277.4192 Mbps
nuttcp-t: 100756 I/O calls, msec/call = 0.10, calls/sec = 10065.88
nuttcp-t: 0.0user 9.9sys 0:10real 100% 0i+0d 0maxrss 0+1pf 0+16csw

nuttcp-r: v5.1.12: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 192.168.88.8
nuttcp-r: send window size = 524288, receive window size = 83886080
nuttcp-r: 6297.2500 MB in 10.01 real seconds = 644213.32 KB/sec = 5277.3955 Mbps
nuttcp-r: 167922 I/O calls, msec/call = 0.06, calls/sec = 16775.92
nuttcp-r: 0.0user 8.5sys 0:10real 85% 0i+0d 0maxrss 0+2pf 97934+26csw

[EMAIL PROTECTED] nuttcp -v -w60m 192.168.88.9
nuttcp-t: v5.1.12: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 192.168.88.9
nuttcp-t: time limit = 10.00 seconds
nuttcp-t: connect to 192.168.88.9
nuttcp-t: send window size = 125829120, receive window size = 524288
nuttcp-t: 5750.1250 MB in 10.02 real seconds = 587661.57 KB/sec = 4814.1236 Mbps
nuttcp-t: 92002 I/O calls, msec/call = 0.11, calls/sec = 9182.21
nuttcp-t: 0.0user 9.9sys 0:10real 100% 0i+0d 0maxrss 0+1pf 0+30csw

nuttcp-r: v5.1.12: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 192.168.88.8
nuttcp-r: send window size = 524288, receive window size = 125829120
nuttcp-r: 5750.1250 MB in 10.02 real seconds = 587665.56 KB/sec = 4814.1563 Mbps
nuttcp-r: 263664 I/O calls, msec/call = 0.04, calls/sec = 26315.03
nuttcp-r: 0.0user 6.7sys 0:10real 68% 0i+0d 0maxrss 0+2pf 143009+23csw

[EMAIL PROTECTED] nuttcp -v -w80m 192.168.88.9
nuttcp-t: v5.1.12: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 192.168.88.9
nuttcp-t: time limit = 10.00 seconds
nuttcp-t: connect to 192.168.88.9
nuttcp-t: send window size = 167772160, receive window size = 524288
nuttcp-t: 6310.0000 MB in 10.01 real seconds = 645521.65 KB/sec = 5288.1134 Mbps
nuttcp-t: 100960 I/O calls, msec/call = 0.10, calls/sec = 10086.28
nuttcp-t: 0.0user 9.9sys 0:10real 100% 0i+0d 0maxrss 0+1pf 0+16csw

nuttcp-r: v5.1.12: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 192.168.88.8
nuttcp-r: send window size = 524288, receive window size = 167772160
nuttcp-r: 6310.0000 MB in 10.01 real seconds = 645522.43 KB/sec = 5288.1197 Mbps
nuttcp-r: 174166 I/O calls, msec/call = 0.06, calls/sec = 17399.85
nuttcp-r: 0.0user 8.4sys 0:10real 85% 0i+0d 0maxrss 0+2pf 95881+26csw

[EMAIL PROTECTED] nuttcp -v -w100m 192.168.88.9
nuttcp-t: v5.1.12: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 192.168.88.9
nuttcp-t: time limit = 10.00 seconds
nuttcp-t: connect to 192.168.88.9
nuttcp-t: send window size = 209715200, receive window size = 524288
nuttcp-t: 5958.1875 MB in 10.01 real seconds = 609535.87 KB/sec = 4993.3178 Mbps
nuttcp-t: 95331 I/O calls, msec/call = 0.11, calls/sec = 9524.00
nuttcp-t: 0.0user 9.9sys 0:10real 100% 0i+0d 0maxrss 0+1pf 0+32csw

nuttcp-r: v5.1.12: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 192.168.88.8
nuttcp-r: send window size = 524288, receive window size = 209715200
nuttcp-r: 5958.1875 MB in 10.01 real seconds = 609534.28 KB/sec = 4993.3048 Mbps
nuttcp-r: 163920 I/O calls, msec/call = 0.06, calls/sec = 16376.31
nuttcp-r: 0.0user 7.7sys 0:10real 77% 0i+0d 0maxrss 0+2pf 129073+20csw

[EMAIL PROTECTED] nuttcp -v -w200m 192.168.88.9
nuttcp-t: v5.1.12: socket
nuttcp-t: buflen=65536, nstream=1, port=5001 tcp -> 192.168.88.9
nuttcp-t: time limit = 10.00 seconds
nuttcp-t: connect to 192.168.88.9
nuttcp-t: send window size = 419430400, receive window size = 524288
nuttcp-t: 5880.1250 MB in 10.01 real seconds = 601546.47 KB/sec = 4927.8687 Mbps
nuttcp-t: 94082 I/O calls, msec/call = 0.11, calls/sec = 9399.16
nuttcp-t: 0.0user 9.9sys 0:10real 100% 0i+0d 0maxrss 0+1pf 0+30csw

nuttcp-r: v5.1.12: socket
nuttcp-r: buflen=65536, nstream=1, port=5001 tcp
nuttcp-r: accept from 192.168.88.8
nuttcp-r: send window size = 524288, receive window size = 419430400
nuttcp-r: 5880.1250 MB in 10.01 real seconds = 601548.52 KB/sec = 4927.8854 Mbps
nuttcp-r: 177422 I/O calls, msec/call = 0.06, calls/sec = 17725.22
nuttcp-r: 0.0user 7.2sys 0:10real 73% 0i+0d 0maxrss 0+2pf 133736+21csw

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to