W dniu 03.11.2018 o 01:18, Paweł Staszewski pisze:
W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:
On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
Hi
So maybee someone will be interested how linux kernel handles
normal
traffic (not pktgen :) )
Server HW configuration:
CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
Server software:
FRR - as routing daemon
enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
local
numa
node)
enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
numa
node)
Maximum traffic that server can handle:
Bandwidth
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
\ iface Rx Tx Total
=================================================================
====
=========
enp175s0f1: 28.51 Gb/s 37.24
Gb/s
65.74 Gb/s
enp175s0f0: 38.07 Gb/s 28.44
Gb/s
66.51 Gb/s
---------------------------------------------------------------
----
-----------
total: 66.58 Gb/s 65.67
Gb/s
132.25 Gb/s
Packets per second:
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
- iface Rx Tx Total
=================================================================
====
=========
enp175s0f1: 5248589.00 P/s 3486617.75 P/s
8735207.00 P/s
enp175s0f0: 3557944.25 P/s 5232516.00 P/s
8790460.00 P/s
---------------------------------------------------------------
----
-----------
total: 8806533.00 P/s 8719134.00 P/s
17525668.00 P/s
After reaching that limits nics on the upstream side (more RX
traffic)
start to drop packets
I just dont understand that server can't handle more bandwidth
(~40Gbit/s is limit where all cpu's are 100% util) - where pps on
RX
side are increasing.
Where do you see 40 Gb/s ? you showed that both ports on the same
NIC (
same pcie link) are doing 66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
132.25
Gb/s which aligns with your pcie link limit, what am i missing ?
hmm yes that was my concern also - cause cant find anywhere
informations
about that bandwidth is uni or bidirectional - so if 126Gbit for x16
8GT
is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
bw
on both ports
i think it is bidir
So yes - we are hitting there other problem i think pcie is most
probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time
TX should be 126Gbit
So one 2-port 100G card connectx4 replaced with two separate connectx5
placed in two different pcie x16 gen 3.0
lspci -vvv -s af:00.0
af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
[ConnectX-5]
Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 90
NUMA node: 1
Region 0: Memory at 39bffe000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at ee600000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not
supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,
LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: CX515A - ConnectX-5 QSFP28
Read-only fields:
[PN] Part number: MCX515A-CCAT
[EC] Engineering changes: A6
[V2] Vendor specific: MCX515A-CCAT
[SN] Serial number: MT1831J00221
[V3] Vendor specific:
14a5c73bee92e811800098039b1ee5f0
[VA] Vendor specific:
MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
[V0] Vendor specific: PCIeGen3 x16
[RV] Reserved: checksum good, 2 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn-
ChkCap+ ChkEn-
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 0, Total VFs: 0, Number of VFs: 0,
Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 1018
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] #19
Kernel driver in use: mlx5_core
d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family
[ConnectX-5]
Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 159
NUMA node: 1
Region 0: Memory at 39fffe000000 (64-bit, prefetchable) [size=32M]
Expansion ROM at fbe00000 [disabled] [size=1M]
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
unlimited, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+
AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not
supported, Exit Latency L0s unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-,
LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
Capabilities: [48] Vital Product Data
Product Name: CX515A - ConnectX-5 QSFP28
Read-only fields:
[PN] Part number: MCX515A-CCAT
[EC] Engineering changes: A6
[V2] Vendor specific: MCX515A-CCAT
[SN] Serial number: MT1831J00169
[V3] Vendor specific:
c06757e6e092e811800098039b1ee520
[VA] Vendor specific:
MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
[V0] Vendor specific: PCIeGen3 x16
[RV] Reserved: checksum good, 2 byte(s) reserved
End
Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=0 offset=00003000
Capabilities: [c0] Vendor Specific Information: Len=18 <?>
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA
PME(D0-,D1-,D2-,D3hot-,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt-
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt-
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
AERCap: First Error Pointer: 04, GenCap+ CGenEn-
ChkCap+ ChkEn-
Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
IOVCap: Migration-, Interrupt Message Number: 000
IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
IOVSta: Migration-
Initial VFs: 0, Total VFs: 0, Number of VFs: 0,
Function Dependency Link: 00
VF offset: 1, stride: 1, Device ID: 1018
Supported Page Size: 000007ff, System Page Size: 00000001
Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
VF Migration: offset: 00000000, BIR: 0
Capabilities: [1c0 v1] #19
Kernel driver in use: mlx5_core
CPU load is lower than for connectx4 - but it looks like bandwidth limit
is the same :)
But also after reaching 60Gbit/60Gbit
bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
input: /proc/net/dev type: rate
- iface Rx Tx Total
==============================================================================
enp175s0: 45.09 Gb/s 15.09 Gb/s
60.18 Gb/s
enp216s0: 15.14 Gb/s 45.19 Gb/s
60.33 Gb/s
------------------------------------------------------------------------------
total: 60.45 Gb/s 60.48 Gb/s 120.93 Gb/s
Nics start to drop packets (discards from nic's where is more rx traffic):
ethtool -S enp175s0 |grep 'disc'
rx_discards_phy: 47265611
after 20 secs
ethtool -S enp175s0 |grep 'disc'
rx_discards_phy: 49434472
current coalescence params:
ethtool -c enp175s0
Coalesce parameters for enp175s0:
Adaptive RX: off TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32651
rx-usecs: 128
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
and perf top:
PerfTop: 86898 irqs/sec kernel:99.5% exact: 0.0% [4000Hz
cycles], (all, 56 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12.76% [kernel] [k] mlx5e_skb_from_cqe_mpwrq_linear
8.68% [kernel] [k] mlx5e_sq_xmit
6.47% [kernel] [k] build_skb
4.78% [kernel] [k] fib_table_lookup
4.58% [kernel] [k] memcpy_erms
3.47% [kernel] [k] mlx5e_poll_rx_cq
2.59% [kernel] [k] mlx5e_handle_rx_cqe_mpwrq
2.37% [kernel] [k] mlx5e_post_rx_mpwqes
2.33% [kernel] [k] vlan_do_receive
1.94% [kernel] [k] __dev_queue_xmit
1.89% [kernel] [k] mlx5e_poll_tx_cq
1.74% [kernel] [k] ip_finish_output2
1.67% [kernel] [k] dev_gro_receive
1.64% [kernel] [k] ipt_do_table
1.58% [kernel] [k] tcp_gro_receive
1.49% [kernel] [k] pfifo_fast_dequeue
1.28% [kernel] [k] mlx5_eq_int
1.26% [kernel] [k] inet_gro_receive
1.26% [kernel] [k] _raw_spin_lock
1.20% [kernel] [k] __netif_receive_skb_core
1.19% [kernel] [k] irq_entries_start
1.17% [kernel] [k] swiotlb_map_page
1.13% [kernel] [k] vlan_dev_hard_start_xmit
1.12% [kernel] [k] ip_route_input_rcu
0.97% [kernel] [k] __build_skb
0.84% [kernel] [k] _raw_spin_lock_irqsave
0.78% [kernel] [k] kmem_cache_alloc
0.77% [kernel] [k] mlx5e_xmit
0.77% [kernel] [k] dev_hard_start_xmit
0.76% [kernel] [k] ip_forward
0.73% [kernel] [k] netif_skb_features
0.70% [kernel] [k] tasklet_action_common.isra.21
0.58% [kernel] [k] validate_xmit_skb.isra.142
0.55% [kernel] [k] ip_rcv_core.isra.20.constprop.25
0.55% [kernel] [k] mlx5e_page_release
0.55% [kernel] [k] __qdisc_run
0.51% [kernel] [k] __memcpy
0.48% [kernel] [k] kmem_cache_free_bulk
0.48% [kernel] [k] page_frag_free
0.47% [kernel] [k] inet_lookup_ifaddr_rcu
0.47% [kernel] [k] queued_spin_lock_slowpath
0.46% [kernel] [k] pfifo_fast_enqueue
0.43% [kernel] [k] tcp4_gro_receive
0.40% [kernel] [k] skb_gro_receive
0.39% [kernel] [k] skb_release_data
0.38% [kernel] [k] find_busiest_group
0.36% [kernel] [k] _raw_spin_trylock
0.36% [kernel] [k] skb_segment
0.33% [kernel] [k] eth_type_trans
0.32% [kernel] [k] __sched_text_start
0.32% [kernel] [k] __netif_schedule
0.32% [kernel] [k] try_to_wake_up
0.31% [kernel] [k] _raw_spin_lock_irq
0.31% [kernel] [k] __local_bh_enable_ip
Also mpstat:
Average: CPU %usr %nice %sys %iowait %irq %soft %steal
%guest %gnice %idle
Average: all 0.06 0.00 1.00 0.02 0.00 21.61 0.00
0.00 0.00 77.32
Average: 0 0.00 0.00 0.60 0.00 0.00 0.00 0.00
0.00 0.00 99.40
Average: 1 0.10 0.00 1.30 0.00 0.00 0.00 0.00
0.00 0.00 98.60
Average: 2 0.00 0.00 0.20 0.00 0.00 0.00 0.00
0.00 0.00 99.80
Average: 3 0.00 0.00 1.60 0.00 0.00 0.00 0.00
0.00 0.00 98.40
Average: 4 0.00 0.00 1.00 0.00 0.00 0.00 0.00
0.00 0.00 99.00
Average: 5 0.20 0.00 4.60 0.00 0.00 0.00 0.00
0.00 0.00 95.20
Average: 6 0.00 0.00 0.20 0.00 0.00 0.00 0.00
0.00 0.00 99.80
Average: 7 0.60 0.00 3.00 0.00 0.00 0.00 0.00
0.00 0.00 96.40
Average: 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 9 0.70 0.00 0.30 0.00 0.00 0.00 0.00
0.00 0.00 99.00
Average: 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 11 0.00 0.00 2.00 0.00 0.00 0.00 0.00
0.00 0.00 98.00
Average: 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 14 0.00 0.00 1.00 0.00 0.00 50.40 0.00
0.00 0.00 48.60
Average: 15 0.00 0.00 1.30 0.00 0.00 47.90 0.00
0.00 0.00 50.80
Average: 16 0.00 0.00 2.00 0.00 0.00 47.80 0.00
0.00 0.00 50.20
Average: 17 0.00 0.00 1.30 0.00 0.00 50.20 0.00
0.00 0.00 48.50
Average: 18 0.10 0.00 1.10 0.00 0.00 42.40 0.00
0.00 0.00 56.40
Average: 19 0.00 0.00 1.50 0.00 0.00 44.40 0.00
0.00 0.00 54.10
Average: 20 0.00 0.00 1.40 0.00 0.00 45.90 0.00
0.00 0.00 52.70
Average: 21 0.00 0.00 0.70 0.00 0.00 44.50 0.00
0.00 0.00 54.80
Average: 22 0.10 0.00 1.40 0.00 0.00 47.00 0.00
0.00 0.00 51.50
Average: 23 0.00 0.00 0.30 0.00 0.00 45.50 0.00
0.00 0.00 54.20
Average: 24 0.00 0.00 1.60 0.00 0.00 50.00 0.00
0.00 0.00 48.40
Average: 25 0.10 0.00 0.70 0.00 0.00 47.00 0.00
0.00 0.00 52.20
Average: 26 0.00 0.00 1.80 0.00 0.00 48.70 0.00
0.00 0.00 49.50
Average: 27 0.00 0.00 1.10 0.00 0.00 44.80 0.00
0.00 0.00 54.10
Average: 28 0.30 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 99.70
Average: 29 0.10 0.00 0.60 0.00 0.00 0.00 0.00
0.00 0.00 99.30
Average: 30 0.00 0.00 0.20 0.00 0.00 0.00 0.00
0.00 0.00 99.80
Average: 31 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 32 0.00 0.00 1.20 0.00 0.00 0.00 0.00
0.00 0.00 98.80
Average: 33 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 34 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 35 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 36 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 37 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 38 0.20 0.00 0.80 0.00 0.00 0.00 0.00
0.00 0.00 99.00
Average: 39 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 40 0.00 0.00 3.30 0.00 0.00 0.00 0.00
0.00 0.00 96.70
Average: 41 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
Average: 42 0.00 0.00 0.80 0.00 0.00 45.00 0.00
0.00 0.00 54.20
Average: 43 0.00 0.00 1.60 0.00 0.00 48.30 0.00
0.00 0.00 50.10
Average: 44 0.00 0.00 1.60 0.00 0.00 37.90 0.00
0.00 0.00 60.50
Average: 45 0.30 0.00 1.40 0.00 0.00 32.90 0.00
0.00 0.00 65.40
Average: 46 0.00 0.00 1.50 0.90 0.00 37.60 0.00
0.00 0.00 60.00
Average: 47 0.10 0.00 0.40 0.00 0.00 41.40 0.00
0.00 0.00 58.10
Average: 48 0.20 0.00 1.70 0.00 0.00 38.20 0.00
0.00 0.00 59.90
Average: 49 0.00 0.00 1.40 0.00 0.00 37.20 0.00
0.00 0.00 61.40
Average: 50 0.00 0.00 1.30 0.00 0.00 38.10 0.00
0.00 0.00 60.60
Average: 51 0.00 0.00 0.80 0.00 0.00 39.40 0.00
0.00 0.00 59.80
Average: 52 0.00 0.00 1.70 0.00 0.00 39.50 0.00
0.00 0.00 58.80
Average: 53 0.10 0.00 0.90 0.00 0.00 38.20 0.00
0.00 0.00 60.80
Average: 54 0.00 0.00 1.30 0.00 0.00 42.10 0.00
0.00 0.00 56.60
Average: 55 0.00 0.00 1.60 0.00 0.00 37.70 0.00
0.00 0.00 60.70
So it looks like previously there was also no problem with pciexpress x16
This can explain maybee also why cpuload is rising rapidly from
120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
so
there can be some error in reading them when offloading (gro/gso/tso)
on
nic's is enabled that is why
Was thinking that maybee reached some pcie x16 limit - but x16
8GT
is
126Gbit - and also when testing with pktgen i can reach more bw
and
pps
(like 4x more comparing to normal internet traffic)
Are you forwarding when using pktgen as well or you just testing
the RX
side pps ?
Yes pktgen was tested on single port RX
Can check also forwarding to eliminate pciex limits
So this explains why you have more RX pps, since tx is idle and pcie
will be free to do only rx.
[...]
ethtool -S enp175s0f1
NIC statistics:
rx_packets: 173730800927
rx_bytes: 99827422751332
tx_packets: 142532009512
tx_bytes: 184633045911222
tx_tso_packets: 25989113891
tx_tso_bytes: 132933363384458
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 74630239613
tx_nop: 2029817748
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 173730800927
rx_csum_unnecessary: 0
rx_csum_none: 434357
rx_csum_complete: 173730366570
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 38260960853
tx_csum_partial: 36369278774
tx_csum_partial_inner: 0
tx_queue_stopped: 1
tx_queue_dropped: 0
tx_xmit_more: 748638099
tx_recover: 0
tx_cqes: 73881645031
tx_queue_wake: 1
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0
If this is a pcie bottleneck it might be useful to enable CQE
compression (to reduce PCIe completion descriptors transactions)
you should see the above rx_cqe_compress_pkts increasing when
enabled.
$ ethtool --set-priv-flags enp175s0f1 rx_cqe_compress on
$ ethtool --show-priv-flags enp175s0f1
Private flags for p6p1:
rx_cqe_moder : on
cqe_moder : off
rx_cqe_compress : on
...
try this on both interfaces.
Done
ethtool --show-priv-flags enp175s0f1
Private flags for enp175s0f1:
rx_cqe_moder : on
tx_cqe_moder : off
rx_cqe_compress : on
rx_striding_rq : off
rx_no_csum_complete: off
ethtool --show-priv-flags enp175s0f0
Private flags for enp175s0f0:
rx_cqe_moder : on
tx_cqe_moder : off
rx_cqe_compress : on
rx_striding_rq : off
rx_no_csum_complete: off
did it help reduce the load on the pcie ? do you see more pps ?
what is the ratio between rx_cqe_compress_pkts and over all rx packets
?
[...]
ethtool -S enp175s0f0
NIC statistics:
rx_packets: 141574897253
rx_bytes: 184445040406258
tx_packets: 172569543894
tx_bytes: 99486882076365
tx_tso_packets: 9367664195
tx_tso_bytes: 56435233992948
tx_tso_inner_packets: 0
tx_tso_inner_bytes: 0
tx_added_vlan_packets: 141297671626
tx_nop: 2102916272
rx_lro_packets: 0
rx_lro_bytes: 0
rx_ecn_mark: 0
rx_removed_vlan_packets: 141574897252
rx_csum_unnecessary: 0
rx_csum_none: 23135854
rx_csum_complete: 141551761398
rx_csum_unnecessary_inner: 0
rx_xdp_drop: 0
rx_xdp_redirect: 0
rx_xdp_tx_xmit: 0
rx_xdp_tx_full: 0
rx_xdp_tx_err: 0
rx_xdp_tx_cqe: 0
tx_csum_none: 127934791664
It is a good idea to look into this, tx is not requesting hw tx
csumming for a lot of packets, maybe you are wasting a lot of cpu
on
calculating csum, or maybe this is just the rx csum complete..
tx_csum_partial: 13362879974
tx_csum_partial_inner: 0
tx_queue_stopped: 232561
TX queues are stalling, could be an indentation for the pcie
bottelneck.
tx_queue_dropped: 0
tx_xmit_more: 1266021946
tx_recover: 0
tx_cqes: 140031716469
tx_queue_wake: 232561
tx_udp_seg_rem: 0
tx_cqe_err: 0
tx_xdp_xmit: 0
tx_xdp_full: 0
tx_xdp_err: 0
tx_xdp_cqes: 0
rx_wqe_err: 0
rx_mpwqe_filler_cqes: 0
rx_mpwqe_filler_strides: 0
rx_buff_alloc_err: 0
rx_cqe_compress_blks: 0
rx_cqe_compress_pkts: 0
rx_page_reuse: 0
rx_cache_reuse: 16625975793
rx_cache_full: 54161465914
rx_cache_empty: 258048
rx_cache_busy: 54161472735
rx_cache_waive: 0
rx_congst_umr: 0
rx_arfs_err: 0
ch_events: 40572621887
ch_poll: 40885650979
ch_arm: 40429276692
ch_aff_change: 0
ch_eq_rearm: 0
rx_out_of_buffer: 2791690
rx_if_down_packets: 74
rx_vport_unicast_packets: 141843476308
rx_vport_unicast_bytes: 185421265403318
tx_vport_unicast_packets: 172569484005
tx_vport_unicast_bytes: 100019940094298
rx_vport_multicast_packets: 85122935
rx_vport_multicast_bytes: 5761316431
tx_vport_multicast_packets: 6452
tx_vport_multicast_bytes: 643540
rx_vport_broadcast_packets: 22423624
rx_vport_broadcast_bytes: 1390127090
tx_vport_broadcast_packets: 22024
tx_vport_broadcast_bytes: 1321440
rx_vport_rdma_unicast_packets: 0
rx_vport_rdma_unicast_bytes: 0
tx_vport_rdma_unicast_packets: 0
tx_vport_rdma_unicast_bytes: 0
rx_vport_rdma_multicast_packets: 0
rx_vport_rdma_multicast_bytes: 0
tx_vport_rdma_multicast_packets: 0
tx_vport_rdma_multicast_bytes: 0
tx_packets_phy: 172569501577
rx_packets_phy: 142871314588
rx_crc_errors_phy: 0
tx_bytes_phy: 100710212814151
rx_bytes_phy: 187209224289564
tx_multicast_phy: 6452
tx_broadcast_phy: 22024
rx_multicast_phy: 85122933
rx_broadcast_phy: 22423623
rx_in_range_len_errors_phy: 2
rx_out_of_range_len_phy: 0
rx_oversize_pkts_phy: 0
rx_symbol_err_phy: 0
tx_mac_control_phy: 0
rx_mac_control_phy: 0
rx_unsupported_op_phy: 0
rx_pause_ctrl_phy: 0
tx_pause_ctrl_phy: 0
rx_discards_phy: 920161423
Ok, this port seem to be suffering more, RX is congested, maybe due
to
the pcie bottleneck.
Yes this side is receiving more traffic - second port is +10G more tx
[...]
Average: 17 0.00 0.00 16.60 0.00 0.00 52.10
0.00 0.00 0.00 31.30
Average: 18 0.00 0.00 13.90 0.00 0.00 61.20
0.00 0.00 0.00 24.90
Average: 19 0.00 0.00 9.99 0.00 0.00 70.33
0.00 0.00 0.00 19.68
Average: 20 0.00 0.00 9.00 0.00 0.00 73.00
0.00 0.00 0.00 18.00
Average: 21 0.00 0.00 8.70 0.00 0.00 73.90
0.00 0.00 0.00 17.40
Average: 22 0.00 0.00 15.42 0.00 0.00 58.56
0.00 0.00 0.00 26.03
Average: 23 0.00 0.00 10.81 0.00 0.00 71.67
0.00 0.00 0.00 17.52
Average: 24 0.00 0.00 10.00 0.00 0.00 71.80
0.00 0.00 0.00 18.20
Average: 25 0.00 0.00 11.19 0.00 0.00 71.13
0.00 0.00 0.00 17.68
Average: 26 0.00 0.00 11.00 0.00 0.00 70.80
0.00 0.00 0.00 18.20
Average: 27 0.00 0.00 10.01 0.00 0.00 69.57
0.00 0.00 0.00 20.42
The numa cores are not at 100% util, you have around 20% of idle on
each one.
Yes - no 100% cpu - but the difference between 80% and 100% is like
push
aditional 1-2Gbit/s
yes but, it doens't look like the bottleneck is the cpu, although it is
close to be :)..
Average: 28 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 29 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 30 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 31 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 32 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 33 0.00 0.00 3.90 0.00 0.00 0.00
0.00
0.00 0.00 96.10
Average: 34 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 35 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 36 0.10 0.00 0.20 0.00 0.00 0.00
0.00
0.00 0.00 99.70
Average: 37 0.20 0.00 0.30 0.00 0.00 0.00
0.00
0.00 0.00 99.50
Average: 38 0.00 0.00 0.00 0.00 0.00 0.00
0.00
0.00 0.00 100.00
Average: 39 0.00 0.00 2.60 0.00 0.00 0.00
0.00
0.00 0.00 97.40
Average: 40 0.00 0.00 0.90 0.00 0.00 0.00
0.00
0.00 0.00 99.10
Average: 41 0.10 0.00 0.50 0.00 0.00 0.00
0.00
0.00 0.00 99.40
Average: 42 0.00 0.00 9.91 0.00 0.00 70.67
0.00 0.00 0.00 19.42
Average: 43 0.00 0.00 15.90 0.00 0.00 57.50
0.00 0.00 0.00 26.60
Average: 44 0.00 0.00 12.20 0.00 0.00 66.20
0.00 0.00 0.00 21.60
Average: 45 0.00 0.00 12.00 0.00 0.00 67.50
0.00 0.00 0.00 20.50
Average: 46 0.00 0.00 12.90 0.00 0.00 65.50
0.00 0.00 0.00 21.60
Average: 47 0.00 0.00 14.59 0.00 0.00 60.84
0.00 0.00 0.00 24.58
Average: 48 0.00 0.00 13.59 0.00 0.00 61.74
0.00 0.00 0.00 24.68
Average: 49 0.00 0.00 18.36 0.00 0.00 53.29
0.00 0.00 0.00 28.34
Average: 50 0.00 0.00 15.32 0.00 0.00 58.86
0.00 0.00 0.00 25.83
Average: 51 0.00 0.00 17.60 0.00 0.00 55.20
0.00 0.00 0.00 27.20
Average: 52 0.00 0.00 15.92 0.00 0.00 56.06
0.00 0.00 0.00 28.03
Average: 53 0.00 0.00 13.00 0.00 0.00 62.30
0.00 0.00 0.00 24.70
Average: 54 0.00 0.00 13.20 0.00 0.00 61.50
0.00 0.00 0.00 25.30
Average: 55 0.00 0.00 14.59 0.00 0.00 58.64
0.00 0.00 0.00 26.77
ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32703
rx-usecs: 256
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
ethtool -g enp175s0f0
Ring parameters for enp175s0f0:
Pre-set maximums:
RX: 8192
RX Mini: 0
RX Jumbo: 0
TX: 8192
Current hardware settings:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Also changed a little coalesce params - and best for this config are:
ethtool -c enp175s0f0
Coalesce parameters for enp175s0f0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32573
rx-usecs: 40
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 8
tx-frames: 8
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
Less drops on RX side - and more pps in overall forwarded.
how much improvement ? maybe we can improve our adaptive rx coal to be
efficient for this work load.