(Cc Randy) On Fri, Nov 9, 2018 at 10:13 AM yupeng <yupeng0...@gmail.com> wrote: > > The snmp_counter.rst run a set of simple experiments, explains the > meaning of snmp counters depend on the experiments' results. This is > an initial version, only covers a small part of the snmp counters.
I don't look into much details, so just a few high-level reviews: 1. Please try to group those counters by protocol, it would be easier to search. 2. For many counters you provide a link to RFC, do you just copy and paste them? Please try to expand. 3. _I think_ you don't need to show, for example, how to run a ping command. It's safe to assume readers already know this. Therefore, just explaining those counters is okay. Thanks. > > Signed-off-by: yupeng <yupeng0...@gmail.com> > --- > Documentation/networking/index.rst | 1 + > Documentation/networking/snmp_counter.rst | 963 ++++++++++++++++++++++ > 2 files changed, 964 insertions(+) > create mode 100644 Documentation/networking/snmp_counter.rst > > diff --git a/Documentation/networking/index.rst > b/Documentation/networking/index.rst > index bd89dae8d578..6a47629ef8ed 100644 > --- a/Documentation/networking/index.rst > +++ b/Documentation/networking/index.rst > @@ -31,6 +31,7 @@ Contents: > net_failover > alias > bridge > + snmp_counter > > .. only:: subproject > > diff --git a/Documentation/networking/snmp_counter.rst > b/Documentation/networking/snmp_counter.rst > new file mode 100644 > index 000000000000..2939c5acf675 > --- /dev/null > +++ b/Documentation/networking/snmp_counter.rst > @@ -0,0 +1,963 @@ > +==================== > +snmp counter tutorial > +==================== > + > +This document explains the meaning of snmp counters. For understanding > +their meanings better, this document doesn't explain the counters one > +by one, but creates a set of experiments, and explains the counters > +depend on the experiments' results. The experiments are on one or two > +virtual machines. Except for the test commands we use in the experiments, > +the virtual machines have no other network traffic. We use the 'nstat' > +command to get the values of snmp counters, before every test, we run > +'nstat -n' to update the history, so the 'nstat' output would only > +show the changes of the snmp counters. For more information about > +nstat, please refer: > + > +http://man7.org/linux/man-pages/man8/nstat.8.html > + > +icmp ping > +======== > + > +Run the ping command against the public dns server 8.8.8.8:: > + > + nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1 > + PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. > + 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms > + > + --- 8.8.8.8 ping statistics --- > + 1 packets transmitted, 1 received, 0% packet loss, time 0ms > + rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms > + > +The nstayt result:: > + > + nstatuser@nstat-a:~$ nstat > + #kernel > + IpInReceives 1 0.0 > + IpInDelivers 1 0.0 > + IpOutRequests 1 0.0 > + IcmpInMsgs 1 0.0 > + IcmpInEchoReps 1 0.0 > + IcmpOutMsgs 1 0.0 > + IcmpOutEchos 1 0.0 > + IcmpMsgInType0 1 0.0 > + IcmpMsgOutType8 1 0.0 > + IpExtInOctets 84 0.0 > + IpExtOutOctets 84 0.0 > + IpExtInNoECTPkts 1 0.0 > + > +The nstat output could be divided into two part: one with the 'Ext' > +keyword, another without the 'Ext' keyword. If the counter name > +doesn't have 'Ext', it is defined by one of snmp rfc, if it has 'Ext', > +it is a kernel extent counter. Below we explain them one by one. > + > +The rfc defined counters > +---------------------- > + > +* IpInReceives > +The total number of input datagrams received from interfaces, > +including those received in error. > + > +https://tools.ietf.org/html/rfc1213#page-26 > + > +* IpInDelivers > +The total number of input datagrams successfully delivered to IP > +user-protocols (including ICMP). > + > +https://tools.ietf.org/html/rfc1213#page-28 > + > +* IpOutRequests > +The total number of IP datagrams which local IP user-protocols > +(including ICMP) supplied to IP in requests for transmission. Note > +that this counter does not include any datagrams counted in > +ipForwDatagrams. > + > +https://tools.ietf.org/html/rfc1213#page-28 > + > +* IcmpInMsgs > +The total number of ICMP messages which the entity received. Note > +that this counter includes all those counted by icmpInErrors. > + > +https://tools.ietf.org/html/rfc1213#page-41 > + > +* IcmpInEchoReps > +The number of ICMP Echo Reply messages received. > + > +https://tools.ietf.org/html/rfc1213#page-42 > + > +* IcmpOutMsgs > +The total number of ICMP messages which this entity attempted to send. > +Note that this counter includes all those counted by icmpOutErrors. > + > +https://tools.ietf.org/html/rfc1213#page-43 > + > +* IcmpOutEchos > +The number of ICMP Echo (request) messages sent. > + > +https://tools.ietf.org/html/rfc1213#page-45 > + > +IcmpMsgInType0 and IcmpMsgOutType8 are not defined by any snmp related > +RFCs, but their meaning are quite straightforward, they count the > +packet number of specific icmp packet types. We could find the icmp > +types here: > + > +https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml > + > +Type 8 is echo, type 0 is echo reply. > + > +Until now, we can easily explain these items of the nstat: We sent an > +icmp echo request, so IpOutRequests, IcmpOutMsgs, IcmpOutEchos and > +IcmpMsgOutType8 were increased 1. We got icmp echo reply from 8.8.8.8, > +so IpInReceives, IcmpInMsgs, IcmpInEchoReps, IcmpMsgInType0 were > +increased 1. The icmp echo reply was passed to icmp layer via ip > +layer, so IpInDelivers was increased 1. > + > +Please note, these metrics don't aware LRO/GRO, e.g., IpOutRequests > +might count 1 packet, but hardware splits it to 2, and sends them > +separately. > + > +IpExtInOctets and IpExtOutOctets > +------------------------------ > +They are linux kernel extensions, no rfc definitions. Please note, > +rfc1213 indeed defines ifInOctets and ifOutOctets, but they > +are different things. The ifInOctets and ifOutOctets are packets > +size which includes the mac layer. But IpExtInOctets and IpExtOutOctets > +are only ip layer sizes. > + > +In our example, an ICMP echo request has four parts: > +* 14 bytes mac header > +* 20 bytes ip header > +* 16 bytes icmp header > +* 48 bytes data (default value of the ping command) > + > +So IpExtInOctets value is 20+16+48=84. The IpExtOutOctets is similar. > + > +IpExtInNoECTPkts > +--------------- > +We could find IpExtInNoECTPkts in the nstat output, but kernel provide > +four similar counters, we explain them together, they are: > +* IpExtInNoECTPkts > +* IpExtInECT1Pkts > +* IpExtInECT0Pkts > +* IpExtInCEPkts > + > +They indicate four kinds of ECN IP packets, they are defined here: > + > +https://tools.ietf.org/html/rfc3168#page-6 > + > +These 4 counters calculate how many packets received per ECN > +status. They count the real frame number regardless the LRO/GRO. So > +for the same packet, you might find that IpInReceives count 1, but > +IpExtInNoECTPkts counts 2 or more. > + > +additional explain > +----------------- > +The ip layer counters are recorded by the ip layer code in the kernel. I > mean, if you send a packet to a lower layer directly, Linux > +kernel won't record it. For example, tcpreplay will open an > +AF_PACKET socket, and send the packet to layer 2, although it could send > +an IP packet, you can't find it from the nstat output. Here is an > +example: > + > +We capture the ping packet by tcpdump:: > + > + nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/ping.pcap dst 8.8.8.8 > + > +Then run ping command:: > + > + nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1 > + > +Terminate tcpdump by Ctrl-C, and run 'nstat -n' to update the nstat > +history. Then run tcpreplay:: > + > + nstatuser@nstat-a:~$ sudo tcpreplay --intf1=ens3 /tmp/ping.pcap > + Actual: 1 packets (98 bytes) sent in 0.000278 seconds > + Rated: 352517.9 Bps, 2.82 Mbps, 3597.12 pps > + Flows: 1 flows, 3597.12 fps, 1 flow packets, 0 non-flow > + Statistics for network device: ens3 > + Successful packets: 1 > + Failed packets: 0 > + Truncated packets: 0 > + Retried packets (ENOBUFS): 0 > + Retried packets (EAGAIN): 0 > + > +Check the nstat output:: > + > + nstatuser@nstat-a:~$ nstat > + #kernel > + IpInReceives 1 0.0 > + IpInDelivers 1 0.0 > + IcmpInMsgs 1 0.0 > + IcmpInEchoReps 1 0.0 > + IcmpMsgInType0 1 0.0 > + IpExtInOctets 84 0.0 > + IpExtInNoECTPkts 1 0.0 > + > +We can see, nstat only show the received packet, because the IP layer > +of kernel only know the reply of 8.8.8.8, it doesn't know what > +tcpreplay sent. > + > +At the same time, when you use AF_INET socket, even you use the > +SOCK_RAW option, the IP layer will still try to verify whether the > +packet is an ICMP packet, if it is, kernel will still count it to its > +counters and you can find it in the output of nstat. > + > +tcp 3 way handshake > +================== > + > +On server side, we run:: > + > + nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000 > + Listening on [0.0.0.0] (family 0, port 9000) > + > +On client side, we run:: > + > + nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000 > + Connection to 192.168.122.251 9000 port [tcp/*] succeeded! > + > +The server listened on tcp 9000 port, the client connected to it, they > +completed the 3-way handshake. > + > +On server side, we can find below nstat output:: > + > + nstatuser@nstat-b:~$ nstat | grep -i tcp > + TcpPassiveOpens 1 0.0 > + TcpInSegs 2 0.0 > + TcpOutSegs 1 0.0 > + TcpExtTCPPureAcks 1 0.0 > + > +On client side, we can find below nstat output:: > + > + nstatuser@nstat-a:~$ nstat | grep -i tcp > + TcpActiveOpens 1 0.0 > + TcpInSegs 1 0.0 > + TcpOutSegs 2 0.0 > + > +Except for TcpExtTCPPureAcks, all other counters are defined by rfc1213 > + > +* TcpActiveOpens > +The number of times TCP connections have made a direct transition to > +the SYN-SENT state from the CLOSED state. > + > +https://tools.ietf.org/html/rfc1213#page-47 > + > +* TcpPassiveOpens > +The number of times TCP connections have made a direct transition to > +the SYN-RCVD state from the LISTEN state. > + > +https://tools.ietf.org/html/rfc1213#page-47 > + > +* TcpInSegs > +The total number of segments received, including those received in > +error. This count includes segments received on currently established > +connections. > + > +https://tools.ietf.org/html/rfc1213#page-48 > + > +* TcpOutSegs > +The total number of segments sent, including those on current > +connections but excluding those containing only retransmitted octets. > + > +https://tools.ietf.org/html/rfc1213#page-48 > + > + > +The TcpExtTCPPureAcks is an extension in linux kernel. When kernel > +receives a TCP packet which set ACK flag and with no data, either > +TcpExtTCPPureAcks or TcpExtTCPHPAcks will increase 1. We will discuss > +it in a later section. > + > +Now we can easily explain the nstat outputs on the server side and client > +side. > + > +When the server received the first syn, it replied a syn+ack, and came into > +SYN-RCVD state, so TcpPassiveOpens increased 1. The server received > +syn, sent syn+ack, received ack, so server sent 1 packet, received 2 > +packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ack > +of the 3-way handshake is a pure ack without data, so > +TcpExtTCPPureAcks increased 1. > + > +When the client sent syn, the client came into the SYN-SENT state, so > +TcpActiveOpens increased 1, client sent syn, received syn+ack, sent > +ack, so client sent 2 packets, received 1 packet, TcpInSegs increased > +1, TcpOutSegs increased 2. > + > +Note: about TcpInSegs and TcpOutSegs, rfc1213 doesn't define the > +behaviors when gso/gro/tso are enabled on the NIC (network interface > +card). On current linux implementation, TcpOutSegs awares gso/tso, but > +TcpInSegs doesn't aware gro. So TcpOutSegs will count the actual > +packet number even only 1 packet is sent via tcp layer. If multiple > +packets arrived at a NIC, and they are merged into 1 packet, TcpInSegs > +will only count 1. > + > +tcp disconnect > +============= > + > +Continue our previous example, on the server side, we have run:: > + > + nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000 > + Listening on [0.0.0.0] (family 0, port 9000) > + > +On client side, we have run:: > + > + nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000 > + Connection to 192.168.122.251 9000 port [tcp/*] succeeded! > + > +Now we type Ctrl-C on the client side, stop the tcp connection between the > +two nc command. Then we check the nstat output. > + > +On server side:: > + > + nstatuser@nstat-b:~$ nstat | grep -i tcp > + TcpInSegs 2 0.0 > + TcpOutSegs 1 0.0 > + TcpExtTCPPureAcks 1 0.0 > + TcpExtTCPOrigDataSent 1 0.0 > + > +On client side:: > + > + nstatuser@nstat-b:~$ nstat | grep -i tcp > + TcpInSegs 2 0.0 > + TcpOutSegs 1 0.0 > + TcpExtTCPPureAcks 1 0.0 > + TcpExtTCPOrigDataSent 1 0.0 > + > +Wait for more than 1 minute, run nstat on client again:: > + > + nstatuser@nstat-a:~$ nstat | grep -i tcp > + TcpExtTW 1 0.0 > + > +Most of the counters are explained in the previous section except > +two: TcpExtTCPOrigDataSent and TcpExtTW. Both of them are linux kernel > +extensions. > + > +TcpExtTW means a tcp connection is closed normally via > +time wait stage, not via tcp reuse process. > + > +About TcpExtTCPOrigDataSent, Below kernel patch has a good explanation: > + > +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd > + > +I pasted it here:: > + > + TCPOrigDataSent: number of outgoing packets with original data > + (excluding retransmission but including data-in-SYN). This counter is > + different from TcpOutSegs because TcpOutSegs also tracks pure > + ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate. > + > +the effect of gso and gro > +======================= > + > +The Generic Segmentation Offload (GSO) and Generic Receive Offload > +would affect the metrics of the packet in/out on both ip and tcp > +layer. Here is an iperf example. Before the test, run below command to > +make sure both gso and gro are enabled on the NIC:: > + > + $ sudo ethtool -k ens3 | egrep > '(generic-segmentation-offload|generic-receive-offload)' > + generic-segmentation-offload: on > + generic-receive-offload: on > + > +On server side, run:: > + > + iperf3 -s -p 9000 > + > +On client side, run:: > + > + iperf3 -c 192.168.122.251 -p 9000 -t 5 -P 10 > + > +The server listened on tcp port 9000, the client connected to the server, > +created 10 threads parallel, run 5 seconds. After the pierf3 stopped, we > +run nstat on both the server and client. > + > +On server side:: > + > + nstatuser@nstat-b:~$ nstat > + #kernel > + IpInReceives 36346 0.0 > + IpInDelivers 36346 0.0 > + IpOutRequests 33836 0.0 > + TcpPassiveOpens 11 0.0 > + TcpEstabResets 2 0.0 > + TcpInSegs 36346 0.0 > + TcpOutSegs 33836 0.0 > + TcpOutRsts 20 0.0 > + TcpExtDelayedACKs 26 0.0 > + TcpExtTCPHPHits 32120 0.0 > + TcpExtTCPPureAcks 16 0.0 > + TcpExtTCPHPAcks 5 0.0 > + TcpExtTCPAbortOnData 5 0.0 > + TcpExtTCPAbortOnClose 2 0.0 > + TcpExtTCPRcvCoalesce 7306 0.0 > + TcpExtTCPOFOQueue 1354 0.0 > + TcpExtTCPOrigDataSent 15 0.0 > + IpExtInOctets 311732432 0.0 > + IpExtOutOctets 1785119 0.0 > + IpExtInNoECTPkts 214032 0.0 > + > +Client side:: > + > + nstatuser@nstat-a:~$ nstat > + #kernel > + IpInReceives 33836 0.0 > + IpInDelivers 33836 0.0 > + IpOutRequests 43786 0.0 > + TcpActiveOpens 11 0.0 > + TcpEstabResets 10 0.0 > + TcpInSegs 33836 0.0 > + TcpOutSegs 214072 0.0 > + TcpRetransSegs 3876 0.0 > + TcpExtDelayedACKs 7 0.0 > + TcpExtTCPHPHits 5 0.0 > + TcpExtTCPPureAcks 2719 0.0 > + TcpExtTCPHPAcks 31071 0.0 > + TcpExtTCPSackRecovery 607 0.0 > + TcpExtTCPSACKReorder 61 0.0 > + TcpExtTCPLostRetransmit 90 0.0 > + TcpExtTCPFastRetrans 3806 0.0 > + TcpExtTCPSlowStartRetrans 62 0.0 > + TcpExtTCPLossProbes 38 0.0 > + TcpExtTCPSackRecoveryFail 8 0.0 > + TcpExtTCPSackShifted 203 0.0 > + TcpExtTCPSackMerged 778 0.0 > + TcpExtTCPSackShiftFallback 700 0.0 > + TcpExtTCPSpuriousRtxHostQueues 4 0.0 > + TcpExtTCPAutoCorking 14 0.0 > + TcpExtTCPOrigDataSent 214038 0.0 > + TcpExtTCPHystartTrainDetect 8 0.0 > + TcpExtTCPHystartTrainCwnd 172 0.0 > + IpExtInOctets 1785227 0.0 > + IpExtOutOctets 317789680 0.0 > + IpExtInNoECTPkts 33836 0.0 > + > +The TcpOutSegs and IpOutRequests on the server are 33836, exactly the > +same as IpExtInNoECTPkts, IpInReceives, IpInDelivers and TcpInSegs on > +the client side. During iperf3 test, the server only reply very short > +packets, so gso and gro has no effect on the server's reply. > + > +On the client side, TcpOutSegs is 214072, IpOutRequests is 43786, the > +tcp layer packet out is larger than ip layer packet out, because > +TcpOutSegs count the packet number after gso, but IpOutRequests > +doesn't. On the server side, IpExtInNoECTPkts is 214032, this number > +is smaller a little than the TcpOutSegs on the client side (214072), it > +might cause by the packet loss. The IpInReceives, IpInDelivers and > +TcpInSegs are obviously smaller than the TcpOutSegs on the client side, > +because these counters calculate the packet after gro. > + > +tcp counters in established state > +================================ > + > +Run nc on server:: > + > + nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 > + Listening on [0.0.0.0] (family 0, port 9000) > + > +Run nc on client: > + > + nstatuser@nstat-a:~$ nc -v nstat-b 9000 > + Connection to nstat-b 9000 port [tcp/*] succeeded! > + > +Input a string in the nc client ('hello' in our example): > + > + nstatuser@nstat-a:~$ nc -v nstat-b 9000 > + Connection to nstat-b 9000 port [tcp/*] succeeded! > + hello > + > +The client side nstat output: > + > + nstatuser@nstat-a:~$ nstat > + #kernel > + IpInReceives 1 0.0 > + IpInDelivers 1 0.0 > + IpOutRequests 1 0.0 > + TcpInSegs 1 0.0 > + TcpOutSegs 1 0.0 > + TcpExtTCPPureAcks 1 0.0 > + TcpExtTCPOrigDataSent 1 0.0 > + IpExtInOctets 52 0.0 > + IpExtOutOctets 58 0.0 > + IpExtInNoECTPkts 1 0.0 > + > +The server side nstat output: > + > + nstatuser@nstat-b:~$ nstat > + #kernel > + IpInReceives 1 0.0 > + IpInDelivers 1 0.0 > + IpOutRequests 1 0.0 > + TcpInSegs 1 0.0 > + TcpOutSegs 1 0.0 > + IpExtInOctets 58 0.0 > + IpExtOutOctets 52 0.0 > + IpExtInNoECTPkts 1 0.0 > + > +Input a string in nc client side again ('world' in our exmaple): > + > + nstatuser@nstat-a:~$ nc -v nstat-b 9000 > + Connection to nstat-b 9000 port [tcp/*] succeeded! > + hello > + world > + > +Client side nstat output: > + > + nstatuser@nstat-a:~$ nstat > + #kernel > + IpInReceives 1 0.0 > + IpInDelivers 1 0.0 > + IpOutRequests 1 0.0 > + TcpInSegs 1 0.0 > + TcpOutSegs 1 0.0 > + TcpExtTCPHPAcks 1 0.0 > + TcpExtTCPOrigDataSent 1 0.0 > + IpExtInOctets 52 0.0 > + IpExtOutOctets 58 0.0 > + IpExtInNoECTPkts 1 0.0 > + > + > +Server side nstat output: > + > + nstatuser@nstat-b:~$ nstat > + #kernel > + IpInReceives 1 0.0 > + IpInDelivers 1 0.0 > + IpOutRequests 1 0.0 > + TcpInSegs 1 0.0 > + TcpOutSegs 1 0.0 > + TcpExtTCPHPHits 1 0.0 > + IpExtInOctets 58 0.0 > + IpExtOutOctets 52 0.0 > + IpExtInNoECTPkts 1 0.0 > + > +Compare the first client side output and the second client side > +output, we could find one difference: the first one had a > +'TcpExtTCPPureAcks', but the second one had a > +'TcpExtTCPHPAcks'. The first server side output and the second server > +side output had a difference too: the second server side output had a > +TcpExtTCPHPHits, but the first server side output didn't have it. The > +network traffic patterns were exactly the same: the client sent a packet to > the server, the server replied an ack. But kernel handled them in different > +ways. When kernel receives a tpc packet in the established status, > +kernel has two paths to handle the packet, one is fast path, another > +is slow path. The comment in kernel code provides a good explanation of > +them, I paste them below: > + > + It is split into a fast path and a slow path. The fast path is > + disabled when: > + - A zero window was announced from us - zero window probing > + is only handled properly on the slow path. > + - Out of order segments arrived. > + - Urgent data is expected. > + - There is no buffer space left > + - Unexpected TCP flags/window values/header lengths are received > + (detected by checking the TCP header against pred_flags) > + - Data is sent in both directions. The fast path only supports pure senders > + or pure receivers (this means either the sequence number or the ack > + value must stay constant) > + - Unexpected TCP option. > + > +Kernel will try to use fast path unless any of the above conditions > +are satisfied. If the packets are out of order, kernel will handle > +them in slow path, which means the performance might be not very > +good. Kernel would also come into slow path if the "Delayed ack" is > +used, because when using "Delayed ack", the data is sent in both > +directions. When the tcp window scale option is not used, kernel will > +try to enable fast path immediately when the connection comes into the > established > +state, but if the tcp window scale option is used, kernel will disable > +the fast path at first, and try to enable it after kerenl receives > +packets. We could use the 'ss' command to verify whether the window > +scale option is used. e.g. run below command on either server or > +client: > + > + nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport > = :9000 ) > + Netid Recv-Q Send-Q Local Address:Port Peer > Address:Port > + tcp 0 0 192.168.122.250:40654 > 192.168.122.251:9000 > + ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 > pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 > send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate > 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98 > + > +The 'wscale:7,7' means both server and client set the window scale > +option to 7. Now we could explain the nstat output in our test: > + > +In the first nstat output of client side, the client sent a packet, server > +reply an ack, when kernel handled this ack, the fast path was not > +enabled, so the ack was counted into 'TcpExtTCPPureAcks'. > +In the second nstat output of client side, the client sent a packet again, > +and received another ack from the server, this time, the fast path is > +enabled, and the ack was qualified for fast path, so it was handled by > +the fast path, so this ack was counted into TcpExtTCPHPAcks. > +In the first nstat output of server side, the fast path was not enabled, > +so there was no 'TcpExtTCPHPHits'. > +In the second nstat output of server side, the fast path was enabled, > +and the packet received from client qualified for fast path, so it > +was counted into 'TcpExtTCPHPHits'. > + > +tcp abort > +======== > + > +Some counters indicate the reaons why tcp layer want to send a rst, > +they are: > +* TcpExtTCPAbortOnData > +* TcpExtTCPAbortOnClose > +* TcpExtTCPAbortOnMemory > +* TcpExtTCPAbortOnTimeout > +* TcpExtTCPAbortOnLinger > +* TcpExtTCPAbortFailed > + > +TcpExtTCPAbortOnData > +------------------- > + > +It means tcp layer has data in flight, but need to close the > +connection. So tcp layer sends a rst to the other side, indicate the > +connection is not closed very graceful. An easy way to increase this > +counter is using the SO_LINGER option. Please refer to the SO_LINGER > +section of the socket man page: > + > +http://man7.org/linux/man-pages/man7/socket.7.html). > + > +By default, when an application closes a connection, the close function > +will return immediately and kernel will try to send the in-flight data > +async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger > +to a positive number, the close function won't return immediately, but > +wait for the in-flight data are acked by the other side, the max wait > +time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0, > +when the application closes a connection, kernel will send an rst > +immediately, and increase the TcpExtTCPAbortOnData counter. > + > +We run nc on the server side:: > + > + nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 > + Listening on [0.0.0.0] (family 0, port 9000) > + > +Run below python code on the client side:: > + > + import socket > + import struct > + > + server = 'nstat-b' # server address > + port = 9000 > + > + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > + s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0)) > + s.connect((server, port)) > + s.close() > + > +On client side, we could see TcpExtTCPAbortOnData increased:: > + > + nstatuser@nstat-a:~$ nstat | grep -i abort > + TcpExtTCPAbortOnData 1 0.0 > + > +If we capture packet by tcpdump, we could see the client send rst > +instead of fin. > + > + > +TcpExtTCPAbortOnClose > +-------------------- > + > +This counter means the tcp layer has unread data when an application > +want to close a connection. > + > +On the server side, we run below python script: > + > + import socket > + import time > + > + port = 9000 > + > + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > + s.bind(('0.0.0.0', port)) > + s.listen(1) > + sock, addr = s.accept() > + while True: > + time.sleep(9999999) > + > +This python script listen on 9000 port, but doesn't read anything from > +the connection. > + > +On the client side, we send the string "hello" by nc: > + > + nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000 > + > +Then, we come back to the server side, the server has received the "hello" > +packet, and tcp layer has acked this packet, but the application didn't > +read it yet. We type Ctrl-C to terminate the server script. Then we > +could find TcpExtTCPAbortOnClose increased 1 on the server side: > + > + nstatuser@nstat-b:~$ nstat | grep -i abort > + TcpExtTCPAbortOnClose 1 0.0 > + > +If we run tcpdump on the server side, we could find the server sent a > +rst after we type Ctrl-C. > + > +TcpExtTCPAbortOnMemory > +-------------------- > + > +When an application closes a tcp connection, kernel still need to track > +the connection, let it complete the tcp disconnect process. E.g. an > +app calls the close method of a socket, kernel sends fin to the other > +side of the connection, then the app has no relationship with the > +socket any more, but kernel need to keep the socket, this socket > +becomes an orphan socket, kernel waits for the reply of the other side, > +and would come to the TIME_WAIT state finally. When kernel has no > +enough memory to keep the orphan socket, kernel would send an rst to > +the other side, and delete the socket, in such situation, kernel will > +increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger > +TcpExtTCPAbortOnMemory: > + > +* the memory used by tcp protocol is higher than the third value of > +the tcp_mem. Please refer the tcp_mem section in the tcp man page: > + > +http://man7.org/linux/man-pages/man7/tcp.7.html > + > +* the orphan socket count is higher than net.ipv4.tcp_max_orphans > + > +Below is an example which let the orphan socket count be higher than > +net.ipv4.tcp_max_orphans. > + > +Change tcp_max_orphans to a smaller value on client:: > + > + sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans" > + > +Client code (create 64 connection to server):: > + > + nstatuser@nstat-a:~$ cat client_orphan.py > + import socket > + import time > + > + server = 'nstat-b' # server address > + port = 9000 > + > + count = 64 > + > + connection_list = [] > + > + for i in range(64): > + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > + s.connect((server, port)) > + connection_list.append(s) > + print("connection_count: %d" % len(connection_list)) > + > + while True: > + time.sleep(99999) > + > +Server code (accept 64 connection from client):: > + > + nstatuser@nstat-b:~$ cat server_orphan.py > + import socket > + import time > + > + port = 9000 > + count = 64 > + > + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > + s.bind(('0.0.0.0', port)) > + s.listen(count) > + connection_list = [] > + while True: > + sock, addr = s.accept() > + connection_list.append((sock, addr)) > + print("connection_count: %d" % len(connection_list)) > + > +Run the python scripts on server and client. > + > +On server:: > + > + python3 server_orphan.py > + > +On client:: > + > + python3 client_orphan.py > + > +Run iptables on server:: > + > + sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP > + > +Type Ctrl-C on client, stop client_orphan.py. > + > +Check TcpExtTCPAbortOnMemory on client:: > + > + nstatuser@nstat-a:~$ nstat | grep -i abort > + TcpExtTCPAbortOnMemory 54 0.0 > + > +Check orphane socket count on client:: > + > + nstatuser@nstat-a:~$ ss -s > + Total: 131 (kernel 0) > + TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports > 0 > + > + Transport Total IP IPv6 > + * 0 - - > + RAW 1 0 1 > + UDP 1 1 0 > + TCP 14 13 1 > + INET 16 14 2 > + FRAG 0 0 0 > + > +The explanation of the test: after run server_orphan.py and > +client_orphan.py, we set up 64 connections between server and > +client. Run the iptables command, the server will drop all packets from > +the client, type Ctrl-C on client_orphan.py, the system of the client > +would try to close these connections, and before they are closed > +gracefully, these connections became orphan sockets. As the iptables > +of the server blocked packets from the client, the server won't receive fin > +from the client, so all connection on clients would be stuck on FIN_WAIT_1 > +stage, so they will keep as orphan sockets until timeout. We have echo > +10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would > +only keep 10 orphan sockets, for all other orphan sockets, the client > +system sent rst for them and delete them. We have 64 connections, so > +the 'ss -s' command shows the system has 10 orphan sockets, and the > +value of TcpExtTCPAbortOnMemory was 54. > + > +An additional explanation about orphan socket count: You could find the > +exactly orphan socket count by the 'ss -s' command, but when kernel > +decide whither increases TcpExtTCPAbortOnMemory and sends rst, kernel > +doesn't always check the exactly orphan socket count. For increasing > +performance, kernel checks an approximate count firstly, if the > +approximate count is more than tcp_max_orphans, kernel checks the > +exact count again. So if the approximate count is less than > +tcp_max_orphans, but exactly count is more than tcp_max_orphans, you > +would find TcpExtTCPAbortOnMemory is not increased at all. If > +tcp_max_orphans is large enough, it won't occur, but if you decrease > +tcp_max_orphans to a small value like our test, you might find this > +issue. So in our test, the client set up 64 connections although the > +tcp_max_orphans is 10. If the client only set up 11 connections, we > +can't find the change of TcpExtTCPAbortOnMemory. > + > +TcpExtTCPAbortOnTimeout > +---------------------- > +This counter will increase when any of the tcp timers expire. In this > +situation, kernel won't send rst, just give up the connection. > +Continue the previous test, we wait for several minutes, because the > +iptables on the server blocked the traffic, the server wouldn't receive > +fin, and all the client's orphan sockets would timeout on the > +FIN_WAIT_1 state finally. So we wait for a few minutes, we could find > +10 timeout on the client:: > + > + nstatuser@nstat-a:~$ nstat | grep -i abort > + TcpExtTCPAbortOnTimeout 10 0.0 > + > +TcpExtTCPAbortOnLinger > +--------------------- > +When a tcp connection comes into FIN_WAIT_2 state, instead of waiting > +for the fin packet from the other side, kernel could send a rst and > +delete the socket immediately. This is not the default behavior of > +linux kernel tcp stack, but after configuring socket option, you could > +let kernel follow this behavior. Below is an example. > + > +The server side code:: > + > + nstatuser@nstat-b:~$ cat server_linger.py > + import socket > + import time > + > + port = 9000 > + > + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > + s.bind(('0.0.0.0', port)) > + s.listen(1) > + sock, addr = s.accept() > + while True: > + time.sleep(9999999) > + > +The client side code:: > + > + nstatuser@nstat-a:~$ cat client_linger.py > + import socket > + import struct > + > + server = 'nstat-b' # server address > + port = 9000 > + > + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) > + s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10)) > + s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1)) > + s.connect((server, port)) > + s.close() > + > +Run server_linger.py on server:: > + > + nstatuser@nstat-b:~$ python3 server_linger.py > + > +Run client_linger.py on client:: > + > + nstatuser@nstat-a:~$ python3 client_linger.py > + > +After run client_linger.py, check the output of nstat:: > + > + nstatuser@nstat-a:~$ nstat | grep -i abort > + TcpExtTCPAbortOnLinger 1 0.0 > + > +TcpExtTCPAbortFailed > +------------------- > +The kernel tcp layer will send rst if the RFC 2525 2.17 section is satisfied: > + > +https://tools.ietf.org/html/rfc2525#page-50 > + > +If an internal error occurs during this process, TcpExtTCPAbortFailed > +will be increased. > + > +TcpExtListenOverflows and TcpExtListenDrops > +======================================== > +When kernel receive a syn from a client, and if the tcp accept queue > +is full, kernel will drop the syn and add 1 to TcpExtListenOverflows. > +At the same time kernel will also add 1 to TcpExtListenDrops. When > +a tcp socket is in LISTEN state, and kernel need to drop a packet, > +kernel would always add 1 to TcpExtListenDrops. So increase > +TcpExtListenOverflows would let TcpExtListenDrops increasing at the > +same time, but TcpExtListenDrops would also increase without > +TcpExtListenOverflows increasing, e.g. a memory allocation fail would > +also let TcpExtListenDrops increase. > + > +Note: The above explain bases on kernel 4.15 or above version, on an > +old kernel, the tcp stack has different behavior when tcp accept queue > +is full. On the old kernel, tcp stack won't drop the syn, it would > +complete the 3-way handshake, but as the accept queue is full, tcp > +stack will keep the socket in the tcp half-open queue. As it is in the > +half open queue, tcp stack will send syn+ack on an exponential backoff > +timer, after client replies ack, tcp stack checks whether the accept > +queue is still full, if it is not full, move the socket to accept > +queue, if it is full, keeps the socket in the half-open queue, at next > +time client replies ack, this socket will get another chance to move > +to the accept queue. > + > +Here is an example: > + > +On server, run the nc command, listen on port 9000:: > + > + nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 > + Listening on [0.0.0.0] (family 0, port 9000) > + > +On client, run 3 nc commands in different terminals:: > + > + nstatuser@nstat-a:~$ nc -v nstat-b 9000 > + Connection to nstat-b 9000 port [tcp/*] succeeded! > + > +The nc command only accepts 1 connection, and the accept queue length > +is 1. On current linux implementation, set queue length to n means the > +actual queue length is n+1. Now we create 3 connections, 1 is accepted > +by nc, 2 in accepted queue, so the accept queue is full. > + > +Before running the 4th nc, we clean the nstat history on the server: > + > + nstatuser@nstat-b:~$ nstat -n > + > +Run the 4th nc on the client: > + > + nstatuser@nstat-a:~$ nc -v nstat-b 9000 > + > +If the nc server is running on kernel 4.15 or higher version, you > +won't see the "Connection to ... succeeded!" string, because kernel > +will drop the syn if the accept queue is full. If the nc client is running > +on an old kernel, you could see that the connection is succeeded, > +because kernel would complete the 3-way handshake and keep the socket > +on the half-open queue. > + > +Our test is on kernel 4.15, run nstat on the server: > + > + nstatuser@nstat-b:~$ nstat > + #kernel > + IpInReceives 4 0.0 > + IpInDelivers 4 0.0 > + TcpInSegs 4 0.0 > + TcpExtListenOverflows 4 0.0 > + TcpExtListenDrops 4 0.0 > + IpExtInOctets 240 0.0 > + IpExtInNoECTPkts 4 0.0 > + > +We can see both TcpExtListenOverflows and TcpExtListenDrops are 4. If > +the time between the 4th nc and the nstat is longer, the value of > +TcpExtListenOverflows and TcpExtListenDrops will be larger, because > +the syn of the 4th nc is dropped, it keeps retrying. > + > -- > 2.17.1 >