Re: [PATCH] add an initial version of snmp_counter.rst

Cong Wang Fri, 09 Nov 2018 18:21:03 -0800

(Cc Randy)

On Fri, Nov 9, 2018 at 10:13 AM yupeng <yupeng0...@gmail.com> wrote:
>
> The snmp_counter.rst run a set of simple experiments, explains the
> meaning of snmp counters depend on the experiments' results. This is
> an initial version, only covers a small part of the snmp counters.



I don't look into much details, so just a few high-level reviews:

1. Please try to group those counters by protocol, it would be easier
to search.

2. For many counters you provide a link to RFC, do you just copy
and paste them? Please try to expand.

3. _I think_ you don't need to show, for example, how to run a ping
command. It's safe to assume readers already know this. Therefore,
just explaining those counters is okay.


Thanks.

>
> Signed-off-by: yupeng <yupeng0...@gmail.com>
> ---
>  Documentation/networking/index.rst        |   1 +
>  Documentation/networking/snmp_counter.rst | 963 ++++++++++++++++++++++
>  2 files changed, 964 insertions(+)
>  create mode 100644 Documentation/networking/snmp_counter.rst
>
> diff --git a/Documentation/networking/index.rst 
> b/Documentation/networking/index.rst
> index bd89dae8d578..6a47629ef8ed 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -31,6 +31,7 @@ Contents:
>     net_failover
>     alias
>     bridge
> +   snmp_counter
>
>  .. only::  subproject
>
> diff --git a/Documentation/networking/snmp_counter.rst 
> b/Documentation/networking/snmp_counter.rst
> new file mode 100644
> index 000000000000..2939c5acf675
> --- /dev/null
> +++ b/Documentation/networking/snmp_counter.rst
> @@ -0,0 +1,963 @@
> +====================
> +snmp counter tutorial
> +====================
> +
> +This document explains the meaning of snmp counters. For understanding
> +their meanings better, this document doesn't explain the counters one
> +by one, but creates a set of experiments, and explains the counters
> +depend on the experiments' results. The experiments are on one or two
> +virtual machines. Except for the test commands we use in the experiments,
> +the virtual machines have no other network traffic. We use the 'nstat'
> +command to get the values of snmp counters, before every test, we run
> +'nstat -n' to update the history, so the 'nstat' output would only
> +show the changes of the snmp counters. For more information about
> +nstat, please refer:
> +
> +http://man7.org/linux/man-pages/man8/nstat.8.html
> +
> +icmp ping
> +========
> +
> +Run the ping command against the public dns server 8.8.8.8::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
> +  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
> +
> +  --- 8.8.8.8 ping statistics ---
> +  1 packets transmitted, 1 received, 0% packet loss, time 0ms
> +  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
> +
> +The nstayt result::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  IcmpInMsgs                      1                  0.0
> +  IcmpInEchoReps                  1                  0.0
> +  IcmpOutMsgs                     1                  0.0
> +  IcmpOutEchos                    1                  0.0
> +  IcmpMsgInType0                  1                  0.0
> +  IcmpMsgOutType8                 1                  0.0
> +  IpExtInOctets                   84                 0.0
> +  IpExtOutOctets                  84                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +The nstat output could be divided into two part: one with the 'Ext'
> +keyword, another without the 'Ext' keyword. If the counter name
> +doesn't have 'Ext', it is defined by one of snmp rfc, if it has 'Ext',
> +it is a kernel extent counter. Below we explain them one by one.
> +
> +The rfc defined counters
> +----------------------
> +
> +* IpInReceives
> +The total number of input datagrams received from interfaces,
> +including those received in error.
> +
> +https://tools.ietf.org/html/rfc1213#page-26
> +
> +* IpInDelivers
> +The total number of input datagrams successfully delivered to IP
> +user-protocols (including ICMP).
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IpOutRequests
> +The total number of IP datagrams which local IP user-protocols
> +(including ICMP) supplied to IP in requests for transmission.  Note
> +that this counter does not include any datagrams counted in
> +ipForwDatagrams.
> +
> +https://tools.ietf.org/html/rfc1213#page-28
> +
> +* IcmpInMsgs
> +The total number of ICMP messages which the entity received.  Note
> +that this counter includes all those counted by icmpInErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-41
> +
> +* IcmpInEchoReps
> +The number of ICMP Echo Reply messages received.
> +
> +https://tools.ietf.org/html/rfc1213#page-42
> +
> +* IcmpOutMsgs
> +The total number of ICMP messages which this entity attempted to send.
> +Note that this counter includes all those counted by icmpOutErrors.
> +
> +https://tools.ietf.org/html/rfc1213#page-43
> +
> +* IcmpOutEchos
> +The number of ICMP Echo (request) messages sent.
> +
> +https://tools.ietf.org/html/rfc1213#page-45
> +
> +IcmpMsgInType0 and IcmpMsgOutType8 are not defined by any snmp related
> +RFCs, but their meaning are quite straightforward, they count the
> +packet number of specific icmp packet types. We could find the icmp
> +types here:
> +
> +https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
> +
> +Type 8 is echo, type 0 is echo reply.
> +
> +Until now, we can easily explain these items of the nstat: We sent an
> +icmp echo request, so IpOutRequests, IcmpOutMsgs, IcmpOutEchos and
> +IcmpMsgOutType8 were increased 1. We got icmp echo reply from 8.8.8.8,
> +so IpInReceives, IcmpInMsgs, IcmpInEchoReps, IcmpMsgInType0 were
> +increased 1. The icmp echo reply was passed to icmp layer via ip
> +layer, so IpInDelivers was increased 1.
> +
> +Please note, these metrics don't aware LRO/GRO, e.g., IpOutRequests
> +might count 1 packet, but hardware splits it to 2, and sends them
> +separately.
> +
> +IpExtInOctets and IpExtOutOctets
> +------------------------------
> +They are linux kernel extensions, no rfc definitions. Please note,
> +rfc1213 indeed defines ifInOctets  and ifOutOctets, but they
> +are different things. The ifInOctets and ifOutOctets are packets
> +size which includes the mac layer. But IpExtInOctets and IpExtOutOctets
> +are only ip layer sizes.
> +
> +In our example, an ICMP echo request has four parts:
> +* 14 bytes mac header
> +* 20 bytes ip header
> +* 16 bytes icmp header
> +* 48 bytes data (default value of the ping command)
> +
> +So IpExtInOctets value is 20+16+48=84. The IpExtOutOctets is similar.
> +
> +IpExtInNoECTPkts
> +---------------
> +We could find IpExtInNoECTPkts in the nstat output, but kernel provide
> +four similar counters, we explain them together, they are:
> +* IpExtInNoECTPkts
> +* IpExtInECT1Pkts
> +* IpExtInECT0Pkts
> +* IpExtInCEPkts
> +
> +They indicate four kinds of ECN IP packets, they are defined here:
> +
> +https://tools.ietf.org/html/rfc3168#page-6
> +
> +These 4 counters calculate how many packets received per ECN
> +status. They count the real frame number regardless the LRO/GRO. So
> +for the same packet, you might find that IpInReceives count 1, but
> +IpExtInNoECTPkts counts 2 or more.
> +
> +additional explain
> +-----------------
> +The ip layer counters are recorded by the ip layer code in the kernel. I 
> mean, if you send a packet to a lower layer directly, Linux
> +kernel won't record it. For example, tcpreplay will open an
> +AF_PACKET socket, and send the packet to layer 2, although it could send
> +an IP packet, you can't find it from the nstat output. Here is an
> +example:
> +
> +We capture the ping packet by tcpdump::
> +
> +  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/ping.pcap dst 8.8.8.8
> +
> +Then run ping command::
> +
> +  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
> +
> +Terminate tcpdump by Ctrl-C, and run 'nstat -n' to update the nstat
> +history. Then run tcpreplay::
> +
> +  nstatuser@nstat-a:~$ sudo tcpreplay --intf1=ens3 /tmp/ping.pcap
> +  Actual: 1 packets (98 bytes) sent in 0.000278 seconds
> +  Rated: 352517.9 Bps, 2.82 Mbps, 3597.12 pps
> +  Flows: 1 flows, 3597.12 fps, 1 flow packets, 0 non-flow
> +  Statistics for network device: ens3
> +          Successful packets:        1
> +          Failed packets:            0
> +          Truncated packets:         0
> +          Retried packets (ENOBUFS): 0
> +          Retried packets (EAGAIN):  0
> +
> +Check the nstat output::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IcmpInMsgs                      1                  0.0
> +  IcmpInEchoReps                  1                  0.0
> +  IcmpMsgInType0                  1                  0.0
> +  IpExtInOctets                   84                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +We can see, nstat only show the received packet, because the IP layer
> +of kernel only know the reply of 8.8.8.8, it doesn't know what
> +tcpreplay sent.
> +
> +At the same time, when you use AF_INET socket, even you use the
> +SOCK_RAW option, the IP layer will still try to verify whether the
> +packet is an ICMP packet, if it is, kernel will still count it to its
> +counters and you can find it in the output of nstat.
> +
> +tcp 3 way handshake
> +==================
> +
> +On server side, we run::
> +
> +  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client side, we run::
> +
> +  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
> +  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
> +
> +The server listened on tcp 9000 port, the client connected to it, they
> +completed the 3-way handshake.
> +
> +On server side, we can find below nstat output::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpPassiveOpens                 1                  0.0
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +
> +On client side, we can find below nstat output::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i tcp
> +  TcpActiveOpens                  1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      2                  0.0
> +
> +Except for TcpExtTCPPureAcks, all other counters are defined by rfc1213
> +
> +* TcpActiveOpens
> +The number of times TCP connections have made a direct transition to
> +the SYN-SENT state from the CLOSED state.
> +
> +https://tools.ietf.org/html/rfc1213#page-47
> +
> +* TcpPassiveOpens
> +The number of times TCP connections have made a direct transition to
> +the SYN-RCVD state from the LISTEN state.
> +
> +https://tools.ietf.org/html/rfc1213#page-47
> +
> +* TcpInSegs
> +The total number of segments received, including those received in
> +error.  This count includes segments received on currently established
> +connections.
> +
> +https://tools.ietf.org/html/rfc1213#page-48
> +
> +* TcpOutSegs
> +The total number of segments sent, including those on current
> +connections but excluding those containing only retransmitted octets.
> +
> +https://tools.ietf.org/html/rfc1213#page-48
> +
> +
> +The TcpExtTCPPureAcks is an extension in linux kernel. When kernel
> +receives a TCP packet which set ACK flag and with no data, either
> +TcpExtTCPPureAcks or TcpExtTCPHPAcks will increase 1. We will discuss
> +it in a later section.
> +
> +Now we can easily explain the nstat outputs on the server side and client
> +side.
> +
> +When the server received the first syn, it replied a syn+ack, and came into
> +SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
> +syn, sent syn+ack, received ack, so server sent 1 packet, received 2
> +packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ack
> +of the 3-way handshake is a pure ack without data, so
> +TcpExtTCPPureAcks increased 1.
> +
> +When the client sent syn, the client came into the SYN-SENT state, so
> +TcpActiveOpens increased 1, client sent syn, received syn+ack, sent
> +ack, so client sent 2 packets, received 1 packet, TcpInSegs increased
> +1, TcpOutSegs increased 2.
> +
> +Note: about TcpInSegs and TcpOutSegs, rfc1213 doesn't define the
> +behaviors when gso/gro/tso are enabled on the NIC (network interface
> +card). On current linux implementation, TcpOutSegs awares gso/tso, but
> +TcpInSegs doesn't aware gro. So TcpOutSegs will count the actual
> +packet number even only 1 packet is sent via tcp layer. If multiple
> +packets arrived at a NIC, and they are merged into 1 packet, TcpInSegs
> +will only count 1.
> +
> +tcp disconnect
> +=============
> +
> +Continue our previous example, on the server side, we have run::
> +
> +  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client side, we have run::
> +
> +  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
> +  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
> +
> +Now we type Ctrl-C on the client side, stop the tcp connection between the
> +two nc command. Then we check the nstat output.
> +
> +On server side::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +
> +On client side::
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i tcp
> +  TcpInSegs                       2                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +
> +Wait for more than 1 minute, run nstat on client again::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i tcp
> +  TcpExtTW                        1                  0.0
> +
> +Most of the counters are explained in the previous section except
> +two: TcpExtTCPOrigDataSent and TcpExtTW. Both of them are linux kernel
> +extensions.
> +
> +TcpExtTW means a tcp connection is closed normally via
> +time wait stage, not via tcp reuse process.
> +
> +About TcpExtTCPOrigDataSent, Below kernel patch has a good explanation:
> +
> +https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
> +
> +I pasted it here::
> +
> +  TCPOrigDataSent: number of outgoing packets with original data
> +  (excluding retransmission but including data-in-SYN). This counter is
> +  different from TcpOutSegs because TcpOutSegs also tracks pure
> +  ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate.
> +
> +the effect of gso and gro
> +=======================
> +
> +The Generic Segmentation Offload (GSO) and Generic Receive Offload
> +would affect the metrics of the packet in/out on both ip and tcp
> +layer. Here is an iperf example. Before the test, run below command to
> +make sure both gso and gro are enabled on the NIC::
> +
> +  $ sudo ethtool -k ens3 | egrep 
> '(generic-segmentation-offload|generic-receive-offload)'
> +  generic-segmentation-offload: on
> +  generic-receive-offload: on
> +
> +On server side, run::
> +
> +  iperf3 -s -p 9000
> +
> +On client side, run::
> +
> +  iperf3 -c 192.168.122.251 -p 9000 -t 5 -P 10
> +
> +The server listened on tcp port 9000, the client connected to the server,
> +created 10 threads parallel, run 5 seconds. After the pierf3 stopped, we
> +run nstat on both the server and client.
> +
> +On server side::
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    36346              0.0
> +  IpInDelivers                    36346              0.0
> +  IpOutRequests                   33836              0.0
> +  TcpPassiveOpens                 11                 0.0
> +  TcpEstabResets                  2                  0.0
> +  TcpInSegs                       36346              0.0
> +  TcpOutSegs                      33836              0.0
> +  TcpOutRsts                      20                 0.0
> +  TcpExtDelayedACKs               26                 0.0
> +  TcpExtTCPHPHits                 32120              0.0
> +  TcpExtTCPPureAcks               16                 0.0
> +  TcpExtTCPHPAcks                 5                  0.0
> +  TcpExtTCPAbortOnData            5                  0.0
> +  TcpExtTCPAbortOnClose           2                  0.0
> +  TcpExtTCPRcvCoalesce            7306               0.0
> +  TcpExtTCPOFOQueue               1354               0.0
> +  TcpExtTCPOrigDataSent           15                 0.0
> +  IpExtInOctets                   311732432          0.0
> +  IpExtOutOctets                  1785119            0.0
> +  IpExtInNoECTPkts                214032             0.0
> +
> +Client side::
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    33836              0.0
> +  IpInDelivers                    33836              0.0
> +  IpOutRequests                   43786              0.0
> +  TcpActiveOpens                  11                 0.0
> +  TcpEstabResets                  10                 0.0
> +  TcpInSegs                       33836              0.0
> +  TcpOutSegs                      214072             0.0
> +  TcpRetransSegs                  3876               0.0
> +  TcpExtDelayedACKs               7                  0.0
> +  TcpExtTCPHPHits                 5                  0.0
> +  TcpExtTCPPureAcks               2719               0.0
> +  TcpExtTCPHPAcks                 31071              0.0
> +  TcpExtTCPSackRecovery           607                0.0
> +  TcpExtTCPSACKReorder            61                 0.0
> +  TcpExtTCPLostRetransmit         90                 0.0
> +  TcpExtTCPFastRetrans            3806               0.0
> +  TcpExtTCPSlowStartRetrans       62                 0.0
> +  TcpExtTCPLossProbes             38                 0.0
> +  TcpExtTCPSackRecoveryFail       8                  0.0
> +  TcpExtTCPSackShifted            203                0.0
> +  TcpExtTCPSackMerged             778                0.0
> +  TcpExtTCPSackShiftFallback      700                0.0
> +  TcpExtTCPSpuriousRtxHostQueues  4                  0.0
> +  TcpExtTCPAutoCorking            14                 0.0
> +  TcpExtTCPOrigDataSent           214038             0.0
> +  TcpExtTCPHystartTrainDetect     8                  0.0
> +  TcpExtTCPHystartTrainCwnd       172                0.0
> +  IpExtInOctets                   1785227            0.0
> +  IpExtOutOctets                  317789680          0.0
> +  IpExtInNoECTPkts                33836              0.0
> +
> +The TcpOutSegs and IpOutRequests on the server are 33836, exactly the
> +same as IpExtInNoECTPkts, IpInReceives, IpInDelivers and TcpInSegs on
> +the client side. During iperf3 test, the server only reply very short
> +packets, so gso and gro has no effect on the server's reply.
> +
> +On the client side, TcpOutSegs is 214072, IpOutRequests is 43786, the
> +tcp layer packet out is larger than ip layer packet out, because
> +TcpOutSegs count the packet number after gso, but IpOutRequests
> +doesn't. On the server side, IpExtInNoECTPkts is 214032, this number
> +is smaller a little than the TcpOutSegs on the client side (214072), it
> +might cause by the packet loss. The IpInReceives, IpInDelivers and
> +TcpInSegs are obviously smaller than the TcpOutSegs on the client side,
> +because these counters calculate the packet after gro.
> +
> +tcp counters in established state
> +================================
> +
> +Run nc on server::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +Run nc on client:
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +
> +Input a string in the nc client ('hello' in our example):
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +  hello
> +
> +The client side nstat output:
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPPureAcks               1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +  IpExtInOctets                   52                 0.0
> +  IpExtOutOctets                  58                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +The server side nstat output:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  IpExtInOctets                   58                 0.0
> +  IpExtOutOctets                  52                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +Input a string in nc client side again ('world' in our exmaple):
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +  hello
> +  world
> +
> +Client side nstat output:
> +
> +  nstatuser@nstat-a:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPHPAcks                 1                  0.0
> +  TcpExtTCPOrigDataSent           1                  0.0
> +  IpExtInOctets                   52                 0.0
> +  IpExtOutOctets                  58                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +
> +Server side nstat output:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    1                  0.0
> +  IpInDelivers                    1                  0.0
> +  IpOutRequests                   1                  0.0
> +  TcpInSegs                       1                  0.0
> +  TcpOutSegs                      1                  0.0
> +  TcpExtTCPHPHits                 1                  0.0
> +  IpExtInOctets                   58                 0.0
> +  IpExtOutOctets                  52                 0.0
> +  IpExtInNoECTPkts                1                  0.0
> +
> +Compare the first client side output and the second client side
> +output, we could find one difference: the first one had a
> +'TcpExtTCPPureAcks', but the second one had a
> +'TcpExtTCPHPAcks'. The first server side output and the second server
> +side output had a difference too: the second server side output had a
> +TcpExtTCPHPHits, but the first server side output didn't have it. The
> +network traffic patterns were exactly the same: the client sent a packet to 
> the server, the server replied an ack. But kernel handled them in different
> +ways. When kernel receives a tpc packet in the established status,
> +kernel has two paths to handle the packet, one is fast path, another
> +is slow path. The comment in kernel code provides a good explanation of
> +them, I paste them below:
> +
> +  It is split into a fast path and a slow path. The fast path is
> +  disabled when:
> +  - A zero window was announced from us - zero window probing
> +    is only handled properly on the slow path.
> +  - Out of order segments arrived.
> +  - Urgent data is expected.
> +  - There is no buffer space left
> +  - Unexpected TCP flags/window values/header lengths are received
> +    (detected by checking the TCP header against pred_flags)
> +  - Data is sent in both directions. The fast path only supports pure senders
> +    or pure receivers (this means either the sequence number or the ack
> +    value must stay constant)
> +  - Unexpected TCP option.
> +
> +Kernel will try to use fast path unless any of the above conditions
> +are satisfied. If the packets are out of order, kernel will handle
> +them in slow path, which means the performance might be not very
> +good. Kernel would also come into slow path if the "Delayed ack" is
> +used, because when using "Delayed ack", the data is sent in both
> +directions. When the tcp window scale option is not used, kernel will
> +try to enable fast path immediately when the connection comes into the 
> established
> +state, but if the tcp window scale option is used, kernel will disable
> +the fast path at first, and try to enable it after kerenl receives
> +packets. We could use the 'ss' command to verify whether the window
> +scale option is used. e.g. run below command on either server or
> +client:
> +
> +  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport 
> = :9000 )
> +  Netid    Recv-Q     Send-Q            Local Address:Port             Peer 
> Address:Port
> +  tcp      0          0               192.168.122.250:40654         
> 192.168.122.251:9000
> +             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 
> pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 
> send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 
> 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
> +
> +The 'wscale:7,7' means both server and client set the window scale
> +option to 7. Now we could explain the nstat output in our test:
> +
> +In the first nstat output of client side, the client sent a packet, server
> +reply an ack, when kernel handled this ack, the fast path was not
> +enabled, so the ack was counted into 'TcpExtTCPPureAcks'.
> +In the second nstat output of client side, the client sent a packet again,
> +and received another ack from the server, this time, the fast path is
> +enabled, and the ack was qualified for fast path, so it was handled by
> +the fast path, so this ack was counted into TcpExtTCPHPAcks.
> +In the first nstat output of server side, the fast path was not enabled,
> +so there was no 'TcpExtTCPHPHits'.
> +In the second nstat output of server side, the fast path was enabled,
> +and the packet received from client qualified for fast path, so it
> +was counted into 'TcpExtTCPHPHits'.
> +
> +tcp abort
> +========
> +
> +Some counters indicate the reaons why tcp layer want to send a rst,
> +they are:
> +* TcpExtTCPAbortOnData
> +* TcpExtTCPAbortOnClose
> +* TcpExtTCPAbortOnMemory
> +* TcpExtTCPAbortOnTimeout
> +* TcpExtTCPAbortOnLinger
> +* TcpExtTCPAbortFailed
> +
> +TcpExtTCPAbortOnData
> +-------------------
> +
> +It means tcp layer has data in flight, but need to close the
> +connection. So tcp layer sends a rst to the other side, indicate the
> +connection is not closed very graceful. An easy way to increase this
> +counter is using the SO_LINGER option. Please refer to the SO_LINGER
> +section of the socket man page:
> +
> +http://man7.org/linux/man-pages/man7/socket.7.html).
> +
> +By default, when an application closes a connection, the close function
> +will return immediately and kernel will try to send the in-flight data
> +async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
> +to a positive number, the close function won't return immediately, but
> +wait for the in-flight data are acked by the other side, the max wait
> +time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
> +when the application closes a connection, kernel will send an rst
> +immediately, and increase the TcpExtTCPAbortOnData counter.
> +
> +We run nc on the server side::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +Run below python code on the client side::
> +
> +  import socket
> +  import struct
> +
> +  server = 'nstat-b' # server address
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))
> +  s.connect((server, port))
> +  s.close()
> +
> +On client side, we could see TcpExtTCPAbortOnData increased::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnData            1                  0.0
> +
> +If we capture packet by tcpdump, we could see the client send rst
> +instead of fin.
> +
> +
> +TcpExtTCPAbortOnClose
> +--------------------
> +
> +This counter means the tcp layer has unread data when an application
> +want to close a connection.
> +
> +On the server side, we run below python script:
> +
> +  import socket
> +  import time
> +
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(1)
> +  sock, addr = s.accept()
> +  while True:
> +      time.sleep(9999999)
> +
> +This python script listen on 9000 port, but doesn't read anything from
> +the connection.
> +
> +On the client side, we send the string "hello" by nc:
> +
> +  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
> +
> +Then, we come back to the server side, the server has received the "hello"
> +packet, and tcp layer has acked this packet, but the application didn't
> +read it yet. We type Ctrl-C to terminate the server script. Then we
> +could find TcpExtTCPAbortOnClose increased 1 on the server side:
> +
> +  nstatuser@nstat-b:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnClose           1                  0.0
> +
> +If we run tcpdump on the server side, we could find the server sent a
> +rst after we type Ctrl-C.
> +
> +TcpExtTCPAbortOnMemory
> +--------------------
> +
> +When an application closes a tcp connection, kernel still need to track
> +the connection, let it complete the tcp disconnect process. E.g. an
> +app calls the close method of a socket, kernel sends fin to the other
> +side of the connection, then the app has no relationship with the
> +socket any more, but kernel need to keep the socket, this socket
> +becomes an orphan socket, kernel waits for the reply of the other side,
> +and would come to the TIME_WAIT state finally. When kernel has no
> +enough memory to keep the orphan socket, kernel would send an rst to
> +the other side, and delete the socket, in such situation, kernel will
> +increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
> +TcpExtTCPAbortOnMemory:
> +
> +* the memory used by tcp protocol is higher than the third value of
> +the tcp_mem. Please refer the tcp_mem section in the tcp man page:
> +
> +http://man7.org/linux/man-pages/man7/tcp.7.html
> +
> +* the orphan socket count is higher than net.ipv4.tcp_max_orphans
> +
> +Below is an example which let the orphan socket count be higher than
> +net.ipv4.tcp_max_orphans.
> +
> +Change tcp_max_orphans to a smaller value on client::
> +
> +  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
> +
> +Client code (create 64 connection to server)::
> +
> +  nstatuser@nstat-a:~$ cat client_orphan.py
> +  import socket
> +  import time
> +
> +  server = 'nstat-b' # server address
> +  port = 9000
> +
> +  count = 64
> +
> +  connection_list = []
> +
> +  for i in range(64):
> +      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +      s.connect((server, port))
> +      connection_list.append(s)
> +      print("connection_count: %d" % len(connection_list))
> +
> +  while True:
> +      time.sleep(99999)
> +
> +Server code (accept 64 connection from client)::
> +
> +  nstatuser@nstat-b:~$ cat server_orphan.py
> +  import socket
> +  import time
> +
> +  port = 9000
> +  count = 64
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(count)
> +  connection_list = []
> +  while True:
> +      sock, addr = s.accept()
> +      connection_list.append((sock, addr))
> +      print("connection_count: %d" % len(connection_list))
> +
> +Run the python scripts on server and client.
> +
> +On server::
> +
> +  python3 server_orphan.py
> +
> +On client::
> +
> +  python3 client_orphan.py
> +
> +Run iptables on server::
> +
> +  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
> +
> +Type Ctrl-C on client, stop client_orphan.py.
> +
> +Check TcpExtTCPAbortOnMemory on client::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnMemory          54                 0.0
> +
> +Check orphane socket count on client::
> +
> +  nstatuser@nstat-a:~$ ss -s
> +  Total: 131 (kernel 0)
> +  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports > 0
> +
> +  Transport Total     IP        IPv6
> +  *         0         -         -
> +  RAW       1         0         1
> +  UDP       1         1         0
> +  TCP       14        13        1
> +  INET      16        14        2
> +  FRAG      0         0         0
> +
> +The explanation of the test: after run server_orphan.py and
> +client_orphan.py, we set up 64 connections between server and
> +client. Run the iptables command, the server will drop all packets from
> +the client, type Ctrl-C on client_orphan.py, the system of the client
> +would try to close these connections, and before they are closed
> +gracefully, these connections became orphan sockets. As the iptables
> +of the server blocked packets from the client, the server won't receive fin
> +from the client, so all connection on clients would be stuck on FIN_WAIT_1
> +stage, so they will keep as orphan sockets until timeout. We have echo
> +10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
> +only keep 10 orphan sockets, for all other orphan sockets, the client
> +system sent rst for them and delete them. We have 64 connections, so
> +the 'ss -s' command shows the system has 10 orphan sockets, and the
> +value of TcpExtTCPAbortOnMemory was 54.
> +
> +An additional explanation about orphan socket count: You could find the
> +exactly orphan socket count by the 'ss -s' command, but when kernel
> +decide whither increases TcpExtTCPAbortOnMemory and sends rst, kernel
> +doesn't always check the exactly orphan socket count. For increasing
> +performance, kernel checks an approximate count firstly, if the
> +approximate count is more than tcp_max_orphans, kernel checks the
> +exact count again. So if the approximate count is less than
> +tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
> +would find TcpExtTCPAbortOnMemory is not increased at all. If
> +tcp_max_orphans is large enough, it won't occur, but if you decrease
> +tcp_max_orphans to a small value like our test, you might find this
> +issue. So in our test, the client set up 64 connections although the
> +tcp_max_orphans is 10. If the client only set up 11 connections, we
> +can't find the change of TcpExtTCPAbortOnMemory.
> +
> +TcpExtTCPAbortOnTimeout
> +----------------------
> +This counter will increase when any of the tcp timers expire. In this
> +situation, kernel won't send rst, just give up the connection.
> +Continue the previous test, we wait for several minutes, because the
> +iptables on the server blocked the traffic, the server wouldn't receive
> +fin, and all the client's orphan sockets would timeout on the
> +FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
> +10 timeout on the client::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnTimeout         10                 0.0
> +
> +TcpExtTCPAbortOnLinger
> +---------------------
> +When a tcp connection comes into FIN_WAIT_2 state, instead of waiting
> +for the fin packet from the other side, kernel could send a rst and
> +delete the socket immediately. This is not the default behavior of
> +linux kernel tcp stack, but after configuring socket option, you could
> +let kernel follow this behavior. Below is an example.
> +
> +The server side code::
> +
> +  nstatuser@nstat-b:~$ cat server_linger.py
> +  import socket
> +  import time
> +
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.bind(('0.0.0.0', port))
> +  s.listen(1)
> +  sock, addr = s.accept()
> +  while True:
> +      time.sleep(9999999)
> +
> +The client side code::
> +
> +  nstatuser@nstat-a:~$ cat client_linger.py
> +  import socket
> +  import struct
> +
> +  server = 'nstat-b' # server address
> +  port = 9000
> +
> +  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
> +  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
> +  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
> +  s.connect((server, port))
> +  s.close()
> +
> +Run server_linger.py on server::
> +
> +  nstatuser@nstat-b:~$ python3 server_linger.py
> +
> +Run client_linger.py on client::
> +
> +  nstatuser@nstat-a:~$ python3 client_linger.py
> +
> +After run client_linger.py, check the output of nstat::
> +
> +  nstatuser@nstat-a:~$ nstat | grep -i abort
> +  TcpExtTCPAbortOnLinger          1                  0.0
> +
> +TcpExtTCPAbortFailed
> +-------------------
> +The kernel tcp layer will send rst if the RFC 2525 2.17 section is satisfied:
> +
> +https://tools.ietf.org/html/rfc2525#page-50
> +
> +If an internal error occurs during this process, TcpExtTCPAbortFailed
> +will be increased.
> +
> +TcpExtListenOverflows and TcpExtListenDrops
> +========================================
> +When kernel receive a syn from a client, and if the tcp accept queue
> +is full, kernel will drop the syn and add 1 to TcpExtListenOverflows.
> +At the same time kernel will also add 1 to TcpExtListenDrops. When
> +a tcp socket is in LISTEN state, and kernel need to drop a packet,
> +kernel would always add 1 to TcpExtListenDrops. So increase
> +TcpExtListenOverflows would let TcpExtListenDrops increasing at the
> +same time, but TcpExtListenDrops would also increase without
> +TcpExtListenOverflows increasing, e.g. a memory allocation fail would
> +also let TcpExtListenDrops increase.
> +
> +Note: The above explain bases on kernel 4.15 or above version, on an
> +old kernel, the tcp stack has different behavior when tcp accept queue
> +is full. On the old kernel, tcp stack won't drop the syn, it would
> +complete the 3-way handshake, but as the accept queue is full, tcp
> +stack will keep the socket in the tcp half-open queue. As it is in the
> +half open queue, tcp stack will send syn+ack on an exponential backoff
> +timer, after client replies ack, tcp stack checks whether the accept
> +queue is still full, if it is not full, move the socket to accept
> +queue, if it is full, keeps the socket in the half-open queue, at next
> +time client replies ack, this socket will get another chance to move
> +to the accept queue.
> +
> +Here is an example:
> +
> +On server, run the nc command, listen on port 9000::
> +
> +  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
> +  Listening on [0.0.0.0] (family 0, port 9000)
> +
> +On client, run 3 nc commands in different terminals::
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +  Connection to nstat-b 9000 port [tcp/*] succeeded!
> +
> +The nc command only accepts 1 connection, and the accept queue length
> +is 1. On current linux implementation, set queue length to n means the
> +actual queue length is n+1. Now we create 3 connections, 1 is accepted
> +by nc, 2 in accepted queue, so the accept queue is full.
> +
> +Before running the 4th nc, we clean the nstat history on the server:
> +
> +  nstatuser@nstat-b:~$ nstat -n
> +
> +Run the 4th nc on the client:
> +
> +  nstatuser@nstat-a:~$ nc -v nstat-b 9000
> +
> +If the nc server is running on kernel 4.15 or higher version, you
> +won't see the "Connection to ... succeeded!" string, because kernel
> +will drop the syn if the accept queue is full. If the nc client is running
> +on an old kernel, you could see that the connection is succeeded,
> +because kernel would complete the 3-way handshake and keep the socket
> +on the half-open queue.
> +
> +Our test is on kernel 4.15, run nstat on the server:
> +
> +  nstatuser@nstat-b:~$ nstat
> +  #kernel
> +  IpInReceives                    4                  0.0
> +  IpInDelivers                    4                  0.0
> +  TcpInSegs                       4                  0.0
> +  TcpExtListenOverflows           4                  0.0
> +  TcpExtListenDrops               4                  0.0
> +  IpExtInOctets                   240                0.0
> +  IpExtInNoECTPkts                4                  0.0
> +
> +We can see both TcpExtListenOverflows and TcpExtListenDrops are 4. If
> +the time between the 4th nc and the nstat is longer, the value of
> +TcpExtListenOverflows and TcpExtListenDrops will be larger, because
> +the syn of the 4th nc is dropped, it keeps retrying.
> +
> --
> 2.17.1
>

Re: [PATCH] add an initial version of snmp_counter.rst

Reply via email to