Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

Paweł Staszewski Thu, 08 Nov 2018 11:13:39 -0800



W dniu 03.11.2018 o 01:18, Paweł Staszewski pisze:



W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:

On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:

W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:

On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:

Hi

So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )


Server HW configuration:

CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz

NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)


Server software:

FRR - as routing daemon

enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)

enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)


Maximum traffic that server can handle:

Bandwidth

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
     input: /proc/net/dev type: rate
     \         iface                   Rx Tx                Total
=================================================================
====
=========
          enp175s0f1:          28.51 Gb/s           37.24
Gb/s
65.74 Gb/s
          enp175s0f0:          38.07 Gb/s           28.44
Gb/s
66.51 Gb/s
---------------------------------------------------------------
----
-----------
               total:          66.58 Gb/s           65.67
Gb/s
132.25 Gb/s


Packets per second:

    bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
     input: /proc/net/dev type: rate
     -         iface                   Rx Tx                Total
=================================================================
====
=========
          enp175s0f1:      5248589.00 P/s       3486617.75 P/s
8735207.00 P/s
          enp175s0f0:      3557944.25 P/s       5232516.00 P/s
8790460.00 P/s
---------------------------------------------------------------
----
-----------
               total:      8806533.00 P/s       8719134.00 P/s
17525668.00 P/s


After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets


I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.

Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?

hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports

i think it is bidir

So yes - we are hitting there other problem i think pcie is mostprobabbly bidirectional max bw 126Gbit so RX 126Gbit and at same timeTX should be 126Gbit

So one 2-port 100G card connectx4 replaced with two separate connectx5placed in two different pcie x16 gen 3.0

lspci -vvv -s af:00.0

af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family[ConnectX-5]

        Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]

Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-<TAbort- <MAbort- >SERR- <PERR- INTx-

        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 90
        NUMA node: 1
        Region 0: Memory at 39bffe000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at ee600000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00

DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0sunlimited, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+SlotPowerLimit 0.000W DevCtl: Report errors: Correctable- Non-Fatal- Fatal-Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+FLReset-

                        MaxPayload 256 bytes, MaxReadReq 4096 bytes

DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM notsupported, Exit Latency L0s unlimited, L1 unlimited

                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-SpeedDis- Transmit Margin: Normal Operating Range,EnterModifiedCompliance- ComplianceSOS-

                         Compliance De-emphasis: -6dB

LnkSta2: Current De-emphasis Level: -6dB,EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+,LinkEqualizationRequest-

        Capabilities: [48] Vital Product Data
                Product Name: CX515A - ConnectX-5 QSFP28
                Read-only fields:
                        [PN] Part number: MCX515A-CCAT
                        [EC] Engineering changes: A6
                        [V2] Vendor specific: MCX515A-CCAT
                        [SN] Serial number: MT1831J00221

[V3] Vendor specific:14a5c73bee92e811800098039b1ee5f0 [VA] Vendor specific:MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0

                        [V0] Vendor specific: PCIeGen3 x16
                        [RV] Reserved: checksum good, 2 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mAPME(D0-,D1-,D2-,D3hot-,D3cold+)

                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-NonFatalErr+ AERCap: First Error Pointer: 04, GenCap+ CGenEn-ChkCap+ ChkEn-

        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-

Initial VFs: 0, Total VFs: 0, Number of VFs: 0,Function Dependency Link: 00

                VF offset: 1, stride: 1, Device ID: 1018
                Supported Page Size: 000007ff, System Page Size: 00000001
                Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1c0 v1] #19
        Kernel driver in use: mlx5_core

d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family[ConnectX-5]

        Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]

        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 159
        NUMA node: 1
        Region 0: Memory at 39fffe000000 (64-bit, prefetchable) [size=32M]
        Expansion ROM at fbe00000 [disabled] [size=1M]
        Capabilities: [60] Express (v2) Endpoint, MSI 00

                        MaxPayload 256 bytes, MaxReadReq 4096 bytes

DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM notsupported, Exit Latency L0s unlimited, L1 unlimited

                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

                         Compliance De-emphasis: -6dB

LnkSta2: Current De-emphasis Level: -6dB,EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+,LinkEqualizationRequest-

        Capabilities: [48] Vital Product Data
                Product Name: CX515A - ConnectX-5 QSFP28
                Read-only fields:
                        [PN] Part number: MCX515A-CCAT
                        [EC] Engineering changes: A6
                        [V2] Vendor specific: MCX515A-CCAT
                        [SN] Serial number: MT1831J00169

[V3] Vendor specific:c06757e6e092e811800098039b1ee520 [VA] Vendor specific:MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0

                        [V0] Vendor specific: PCIeGen3 x16
                        [RV] Reserved: checksum good, 2 byte(s) reserved
                End
        Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [c0] Vendor Specific Information: Len=18 <?>
        Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mAPME(D0-,D1-,D2-,D3hot-,D3cold+)

                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Advanced Error Reporting

        Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                IOVSta: Migration-

Initial VFs: 0, Total VFs: 0, Number of VFs: 0,Function Dependency Link: 00

                VF offset: 1, stride: 1, Device ID: 1018
                Supported Page Size: 000007ff, System Page Size: 00000001
                Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Capabilities: [1c0 v1] #19
        Kernel driver in use: mlx5_core

CPU load is lower than for connectx4 - but it looks like bandwidth limitis the same :)

But also after reaching 60Gbit/60Gbit

 bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
  input: /proc/net/dev type: rate
  -         iface                   Rx Tx                Total
==============================================================================

enp175s0: 45.09 Gb/s 15.09 Gb/s 60.18 Gb/s enp216s0: 15.14 Gb/s 45.19 Gb/s 60.33 Gb/s

------------------------------------------------------------------------------
            total:          60.45 Gb/s           60.48 Gb/s 120.93 Gb/s


Nics start to drop packets (discards from nic's where is more rx traffic):
ethtool -S enp175s0 |grep 'disc'
     rx_discards_phy: 47265611

after 20 secs

ethtool -S enp175s0 |grep 'disc'
     rx_discards_phy: 49434472


current coalescence params:
ethtool -c enp175s0
Coalesce parameters for enp175s0:
Adaptive RX: off  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32651

rx-usecs: 128
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0


and perf top:

PerfTop: 86898 irqs/sec kernel:99.5% exact: 0.0% [4000Hzcycles], (all, 56 CPUs)

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    12.76%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
     8.68%  [kernel]       [k] mlx5e_sq_xmit
     6.47%  [kernel]       [k] build_skb
     4.78%  [kernel]       [k] fib_table_lookup
     4.58%  [kernel]       [k] memcpy_erms
     3.47%  [kernel]       [k] mlx5e_poll_rx_cq
     2.59%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
     2.37%  [kernel]       [k] mlx5e_post_rx_mpwqes
     2.33%  [kernel]       [k] vlan_do_receive
     1.94%  [kernel]       [k] __dev_queue_xmit
     1.89%  [kernel]       [k] mlx5e_poll_tx_cq
     1.74%  [kernel]       [k] ip_finish_output2
     1.67%  [kernel]       [k] dev_gro_receive
     1.64%  [kernel]       [k] ipt_do_table
     1.58%  [kernel]       [k] tcp_gro_receive
     1.49%  [kernel]       [k] pfifo_fast_dequeue
     1.28%  [kernel]       [k] mlx5_eq_int
     1.26%  [kernel]       [k] inet_gro_receive
     1.26%  [kernel]       [k] _raw_spin_lock
     1.20%  [kernel]       [k] __netif_receive_skb_core
     1.19%  [kernel]       [k] irq_entries_start
     1.17%  [kernel]       [k] swiotlb_map_page
     1.13%  [kernel]       [k] vlan_dev_hard_start_xmit
     1.12%  [kernel]       [k] ip_route_input_rcu
     0.97%  [kernel]       [k] __build_skb
     0.84%  [kernel]       [k] _raw_spin_lock_irqsave
     0.78%  [kernel]       [k] kmem_cache_alloc
     0.77%  [kernel]       [k] mlx5e_xmit
     0.77%  [kernel]       [k] dev_hard_start_xmit
     0.76%  [kernel]       [k] ip_forward
     0.73%  [kernel]       [k] netif_skb_features
     0.70%  [kernel]       [k] tasklet_action_common.isra.21
     0.58%  [kernel]       [k] validate_xmit_skb.isra.142
     0.55%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
     0.55%  [kernel]       [k] mlx5e_page_release
     0.55%  [kernel]       [k] __qdisc_run
     0.51%  [kernel]       [k] __memcpy
     0.48%  [kernel]       [k] kmem_cache_free_bulk
     0.48%  [kernel]       [k] page_frag_free
     0.47%  [kernel]       [k] inet_lookup_ifaddr_rcu
     0.47%  [kernel]       [k] queued_spin_lock_slowpath
     0.46%  [kernel]       [k] pfifo_fast_enqueue
     0.43%  [kernel]       [k] tcp4_gro_receive
     0.40%  [kernel]       [k] skb_gro_receive
     0.39%  [kernel]       [k] skb_release_data
     0.38%  [kernel]       [k] find_busiest_group
     0.36%  [kernel]       [k] _raw_spin_trylock
     0.36%  [kernel]       [k] skb_segment
     0.33%  [kernel]       [k] eth_type_trans
     0.32%  [kernel]       [k] __sched_text_start
     0.32%  [kernel]       [k] __netif_schedule
     0.32%  [kernel]       [k] try_to_wake_up
     0.31%  [kernel]       [k] _raw_spin_lock_irq
     0.31%  [kernel]       [k] __local_bh_enable_ip



Also mpstat:

Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idleAverage: all 0.06 0.00 1.00 0.02 0.00 21.61 0.00 0.00 0.00 77.32Average: 0 0.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 0.00 99.40Average: 1 0.10 0.00 1.30 0.00 0.00 0.00 0.00 0.00 0.00 98.60Average: 2 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.80Average: 3 0.00 0.00 1.60 0.00 0.00 0.00 0.00 0.00 0.00 98.40Average: 4 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00Average: 5 0.20 0.00 4.60 0.00 0.00 0.00 0.00 0.00 0.00 95.20Average: 6 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.80Average: 7 0.60 0.00 3.00 0.00 0.00 0.00 0.00 0.00 0.00 96.40Average: 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 9 0.70 0.00 0.30 0.00 0.00 0.00 0.00 0.00 0.00 99.00Average: 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 11 0.00 0.00 2.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00Average: 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 14 0.00 0.00 1.00 0.00 0.00 50.40 0.00 0.00 0.00 48.60Average: 15 0.00 0.00 1.30 0.00 0.00 47.90 0.00 0.00 0.00 50.80Average: 16 0.00 0.00 2.00 0.00 0.00 47.80 0.00 0.00 0.00 50.20Average: 17 0.00 0.00 1.30 0.00 0.00 50.20 0.00 0.00 0.00 48.50Average: 18 0.10 0.00 1.10 0.00 0.00 42.40 0.00 0.00 0.00 56.40Average: 19 0.00 0.00 1.50 0.00 0.00 44.40 0.00 0.00 0.00 54.10Average: 20 0.00 0.00 1.40 0.00 0.00 45.90 0.00 0.00 0.00 52.70Average: 21 0.00 0.00 0.70 0.00 0.00 44.50 0.00 0.00 0.00 54.80Average: 22 0.10 0.00 1.40 0.00 0.00 47.00 0.00 0.00 0.00 51.50Average: 23 0.00 0.00 0.30 0.00 0.00 45.50 0.00 0.00 0.00 54.20Average: 24 0.00 0.00 1.60 0.00 0.00 50.00 0.00 0.00 0.00 48.40Average: 25 0.10 0.00 0.70 0.00 0.00 47.00 0.00 0.00 0.00 52.20Average: 26 0.00 0.00 1.80 0.00 0.00 48.70 0.00 0.00 0.00 49.50Average: 27 0.00 0.00 1.10 0.00 0.00 44.80 0.00 0.00 0.00 54.10Average: 28 0.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.70Average: 29 0.10 0.00 0.60 0.00 0.00 0.00 0.00 0.00 0.00 99.30Average: 30 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 99.80Average: 31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 32 0.00 0.00 1.20 0.00 0.00 0.00 0.00 0.00 0.00 98.80Average: 33 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 35 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 36 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 38 0.20 0.00 0.80 0.00 0.00 0.00 0.00 0.00 0.00 99.00Average: 39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 40 0.00 0.00 3.30 0.00 0.00 0.00 0.00 0.00 0.00 96.70Average: 41 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00Average: 42 0.00 0.00 0.80 0.00 0.00 45.00 0.00 0.00 0.00 54.20Average: 43 0.00 0.00 1.60 0.00 0.00 48.30 0.00 0.00 0.00 50.10Average: 44 0.00 0.00 1.60 0.00 0.00 37.90 0.00 0.00 0.00 60.50Average: 45 0.30 0.00 1.40 0.00 0.00 32.90 0.00 0.00 0.00 65.40Average: 46 0.00 0.00 1.50 0.90 0.00 37.60 0.00 0.00 0.00 60.00Average: 47 0.10 0.00 0.40 0.00 0.00 41.40 0.00 0.00 0.00 58.10Average: 48 0.20 0.00 1.70 0.00 0.00 38.20 0.00 0.00 0.00 59.90Average: 49 0.00 0.00 1.40 0.00 0.00 37.20 0.00 0.00 0.00 61.40Average: 50 0.00 0.00 1.30 0.00 0.00 38.10 0.00 0.00 0.00 60.60Average: 51 0.00 0.00 0.80 0.00 0.00 39.40 0.00 0.00 0.00 59.80Average: 52 0.00 0.00 1.70 0.00 0.00 39.50 0.00 0.00 0.00 58.80Average: 53 0.10 0.00 0.90 0.00 0.00 38.20 0.00 0.00 0.00 60.80Average: 54 0.00 0.00 1.30 0.00 0.00 42.10 0.00 0.00 0.00 56.60Average: 55 0.00 0.00 1.60 0.00 0.00 37.70 0.00 0.00 0.00 60.70



So it looks like previously there was also no problem with pciexpress x16

This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why

Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)

Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?

Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits

So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.

[...]

ethtool -S enp175s0f1
NIC statistics:
        rx_packets: 173730800927
        rx_bytes: 99827422751332
        tx_packets: 142532009512
        tx_bytes: 184633045911222
        tx_tso_packets: 25989113891
        tx_tso_bytes: 132933363384458
        tx_tso_inner_packets: 0
        tx_tso_inner_bytes: 0
        tx_added_vlan_packets: 74630239613
        tx_nop: 2029817748
        rx_lro_packets: 0
        rx_lro_bytes: 0
        rx_ecn_mark: 0
        rx_removed_vlan_packets: 173730800927
        rx_csum_unnecessary: 0
        rx_csum_none: 434357
        rx_csum_complete: 173730366570
        rx_csum_unnecessary_inner: 0
        rx_xdp_drop: 0
        rx_xdp_redirect: 0
        rx_xdp_tx_xmit: 0
        rx_xdp_tx_full: 0
        rx_xdp_tx_err: 0
        rx_xdp_tx_cqe: 0
        tx_csum_none: 38260960853
        tx_csum_partial: 36369278774
        tx_csum_partial_inner: 0
        tx_queue_stopped: 1
        tx_queue_dropped: 0
        tx_xmit_more: 748638099
        tx_recover: 0
        tx_cqes: 73881645031
        tx_queue_wake: 1
        tx_udp_seg_rem: 0
        tx_cqe_err: 0
        tx_xdp_xmit: 0
        tx_xdp_full: 0
        tx_xdp_err: 0
        tx_xdp_cqes: 0
        rx_wqe_err: 0
        rx_mpwqe_filler_cqes: 0
        rx_mpwqe_filler_strides: 0
        rx_buff_alloc_err: 0
        rx_cqe_compress_blks: 0
        rx_cqe_compress_pkts: 0

If this is a pcie bottleneck it might be useful to  enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.

$ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s0f1
Private flags for p6p1:
rx_cqe_moder       : on
cqe_moder          : off
rx_cqe_compress    : on
...

try this on both interfaces.

Done
ethtool --show-priv-flags enp175s0f1
Private flags for enp175s0f1:
rx_cqe_moder       : on
tx_cqe_moder       : off
rx_cqe_compress    : on
rx_striding_rq     : off
rx_no_csum_complete: off

ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder       : on
tx_cqe_moder       : off
rx_cqe_compress    : on
rx_striding_rq     : off
rx_no_csum_complete: off

did it help reduce the load on the pcie  ? do you see more pps ?
what is the ratio between rx_cqe_compress_pkts and over all rx packets
?

[...]

ethtool -S enp175s0f0
NIC statistics:
        rx_packets: 141574897253
        rx_bytes: 184445040406258
        tx_packets: 172569543894
        tx_bytes: 99486882076365
        tx_tso_packets: 9367664195
        tx_tso_bytes: 56435233992948
        tx_tso_inner_packets: 0
        tx_tso_inner_bytes: 0
        tx_added_vlan_packets: 141297671626
        tx_nop: 2102916272
        rx_lro_packets: 0
        rx_lro_bytes: 0
        rx_ecn_mark: 0
        rx_removed_vlan_packets: 141574897252
        rx_csum_unnecessary: 0
        rx_csum_none: 23135854
        rx_csum_complete: 141551761398
        rx_csum_unnecessary_inner: 0
        rx_xdp_drop: 0
        rx_xdp_redirect: 0
        rx_xdp_tx_xmit: 0
        rx_xdp_tx_full: 0
        rx_xdp_tx_err: 0
        rx_xdp_tx_cqe: 0
        tx_csum_none: 127934791664

It is a good idea to look into this, tx is not requesting hw tx
csumming for a lot of packets, maybe you are wasting a lot of cpu
on
calculating csum, or maybe this is just the rx csum complete..

        tx_csum_partial: 13362879974
        tx_csum_partial_inner: 0
        tx_queue_stopped: 232561

TX queues are stalling, could be an indentation for the pcie
bottelneck.

        tx_queue_dropped: 0
        tx_xmit_more: 1266021946
        tx_recover: 0
        tx_cqes: 140031716469
        tx_queue_wake: 232561
        tx_udp_seg_rem: 0
        tx_cqe_err: 0
        tx_xdp_xmit: 0
        tx_xdp_full: 0
        tx_xdp_err: 0
        tx_xdp_cqes: 0
        rx_wqe_err: 0
        rx_mpwqe_filler_cqes: 0
        rx_mpwqe_filler_strides: 0
        rx_buff_alloc_err: 0
        rx_cqe_compress_blks: 0
        rx_cqe_compress_pkts: 0
        rx_page_reuse: 0
        rx_cache_reuse: 16625975793
        rx_cache_full: 54161465914
        rx_cache_empty: 258048
        rx_cache_busy: 54161472735
        rx_cache_waive: 0
        rx_congst_umr: 0
        rx_arfs_err: 0
        ch_events: 40572621887
        ch_poll: 40885650979
        ch_arm: 40429276692
        ch_aff_change: 0
        ch_eq_rearm: 0
        rx_out_of_buffer: 2791690
        rx_if_down_packets: 74
        rx_vport_unicast_packets: 141843476308
        rx_vport_unicast_bytes: 185421265403318
        tx_vport_unicast_packets: 172569484005
        tx_vport_unicast_bytes: 100019940094298
        rx_vport_multicast_packets: 85122935
        rx_vport_multicast_bytes: 5761316431
        tx_vport_multicast_packets: 6452
        tx_vport_multicast_bytes: 643540
        rx_vport_broadcast_packets: 22423624
        rx_vport_broadcast_bytes: 1390127090
        tx_vport_broadcast_packets: 22024
        tx_vport_broadcast_bytes: 1321440
        rx_vport_rdma_unicast_packets: 0
        rx_vport_rdma_unicast_bytes: 0
        tx_vport_rdma_unicast_packets: 0
        tx_vport_rdma_unicast_bytes: 0
        rx_vport_rdma_multicast_packets: 0
        rx_vport_rdma_multicast_bytes: 0
        tx_vport_rdma_multicast_packets: 0
        tx_vport_rdma_multicast_bytes: 0
        tx_packets_phy: 172569501577
        rx_packets_phy: 142871314588
        rx_crc_errors_phy: 0
        tx_bytes_phy: 100710212814151
        rx_bytes_phy: 187209224289564
        tx_multicast_phy: 6452
        tx_broadcast_phy: 22024
        rx_multicast_phy: 85122933
        rx_broadcast_phy: 22423623
        rx_in_range_len_errors_phy: 2
        rx_out_of_range_len_phy: 0
        rx_oversize_pkts_phy: 0
        rx_symbol_err_phy: 0
        tx_mac_control_phy: 0
        rx_mac_control_phy: 0
        rx_unsupported_op_phy: 0
        rx_pause_ctrl_phy: 0
        tx_pause_ctrl_phy: 0
        rx_discards_phy: 920161423

Ok, this port seem to be suffering more, RX is congested, maybe due
to
the pcie bottleneck.

Yes this side is receiving more traffic - second port is +10G more tx

[...]

Average:      17    0.00    0.00 16.60    0.00    0.00 52.10
0.00    0.00    0.00   31.30
Average:      18    0.00    0.00   13.90    0.00    0.00 61.20
0.00    0.00    0.00   24.90
Average:      19    0.00    0.00    9.99    0.00    0.00 70.33
0.00    0.00    0.00   19.68
Average:      20    0.00    0.00    9.00    0.00    0.00 73.00
0.00    0.00    0.00   18.00
Average:      21    0.00    0.00    8.70    0.00    0.00 73.90
0.00    0.00    0.00   17.40
Average:      22    0.00    0.00   15.42    0.00    0.00 58.56
0.00    0.00    0.00   26.03
Average:      23    0.00    0.00   10.81    0.00    0.00 71.67
0.00    0.00    0.00   17.52
Average:      24    0.00    0.00   10.00    0.00    0.00 71.80
0.00    0.00    0.00   18.20
Average:      25    0.00    0.00   11.19    0.00    0.00 71.13
0.00    0.00    0.00   17.68
Average:      26    0.00    0.00   11.00    0.00    0.00 70.80
0.00    0.00    0.00   18.20
Average:      27    0.00    0.00   10.01    0.00    0.00 69.57
0.00    0.00    0.00   20.42

The numa cores are not at 100% util, you have around 20% of idle on
each one.

Yes - no 100% cpu - but the difference between 80% and 100% is like
push
aditional 1-2Gbit/s

yes but, it doens't look like the bottleneck is the cpu, although it is
close to be :)..

Average:      28    0.00    0.00 0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      29    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      30    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      31    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      32    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      33    0.00    0.00    3.90    0.00    0.00 0.00
0.00
0.00    0.00   96.10
Average:      34    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      35    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      36    0.10    0.00    0.20    0.00    0.00 0.00
0.00
0.00    0.00   99.70
Average:      37    0.20    0.00    0.30    0.00    0.00 0.00
0.00
0.00    0.00   99.50
Average:      38    0.00    0.00    0.00    0.00    0.00 0.00
0.00
0.00    0.00  100.00
Average:      39    0.00    0.00    2.60    0.00    0.00 0.00
0.00
0.00    0.00   97.40
Average:      40    0.00    0.00    0.90    0.00    0.00 0.00
0.00
0.00    0.00   99.10
Average:      41    0.10    0.00    0.50    0.00    0.00 0.00
0.00
0.00    0.00   99.40
Average:      42    0.00    0.00    9.91    0.00    0.00 70.67
0.00    0.00    0.00   19.42
Average:      43    0.00    0.00   15.90    0.00    0.00 57.50
0.00    0.00    0.00   26.60
Average:      44    0.00    0.00   12.20    0.00    0.00 66.20
0.00    0.00    0.00   21.60
Average:      45    0.00    0.00   12.00    0.00    0.00 67.50
0.00    0.00    0.00   20.50
Average:      46    0.00    0.00   12.90    0.00    0.00 65.50
0.00    0.00    0.00   21.60
Average:      47    0.00    0.00   14.59    0.00    0.00 60.84
0.00    0.00    0.00   24.58
Average:      48    0.00    0.00   13.59    0.00    0.00 61.74
0.00    0.00    0.00   24.68
Average:      49    0.00    0.00   18.36    0.00    0.00 53.29
0.00    0.00    0.00   28.34
Average:      50    0.00    0.00   15.32    0.00    0.00 58.86
0.00    0.00    0.00   25.83
Average:      51    0.00    0.00   17.60    0.00    0.00 55.20
0.00    0.00    0.00   27.20
Average:      52    0.00    0.00   15.92    0.00    0.00 56.06
0.00    0.00    0.00   28.03
Average:      53    0.00    0.00   13.00    0.00    0.00 62.30
0.00    0.00    0.00   24.70
Average:      54    0.00    0.00   13.20    0.00    0.00 61.50
0.00    0.00    0.00   25.30
Average:      55    0.00    0.00   14.59    0.00    0.00 58.64
0.00    0.00    0.00   26.77


ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
           tx-checksum-ipv4: on
           tx-checksum-ip-generic: off [fixed]
           tx-checksum-ipv6: on
           tx-checksum-fcoe-crc: off [fixed]
           tx-checksum-sctp: off [fixed]
scatter-gather: on
           tx-scatter-gather: on
           tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
           tx-tcp-segmentation: on
           tx-tcp-ecn-segmentation: off [fixed]
           tx-tcp-mangleid-segmentation: off
           tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32703

rx-usecs: 256
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

ethtool -g enp175s0f0
Ring parameters for enp175s0f0:
Pre-set maximums:
RX:             8192
RX Mini:        0
RX Jumbo:       0
TX:             8192
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

Also changed a little coalesce params - and best for this config are:
ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32573

rx-usecs: 40
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 8
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0


Less drops on RX side - and more pps in overall forwarded.

how much improvement ? maybe we can improve our adaptive rx coal to be
efficient for this work load.

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

Reply via email to