On Jan 31, 2019 at 19:28:20 +0100, Heiner Kallweit wrote: > Thanks for testing, Peter! > So we have an ASPM-related issue indeed. I'm aware that there are certain > incompatibilities between board chipsets and network chip versions > (although it's not known which combinations are affected). > And we don't know whether it's a hardware or BIOS issue. > > Older driver versions dealt with this by simply disabling ASPM in general. > As a result all systems with a supported Realtek chip didn't reach higher > package power-saving states, resulting in significantly reduced battery > lifetime on notebooks. > The network driver has no stake in dealing with the ASPM policies, this > is handled by lower PCI layers. > > Unfortunately we can't detect ASPM incompatibilities at runtime. Maybe > we could build some heuristics based on rx_missed percentage, but it's > not clear that ASPM issues always show the same symptoms. > > So for now people with affected systems have to set a proper > pcie_aspm.policy parameter. > Just what is not clear to me is why pcie_aspm=off doesn't help. > > @David: > I assume you'll check with the affected user to test the ASPM policy > parameter.
Unfortunately, we did not have any performace improvement when using both kernel parameters. @Peter, thanks for the information. regards, David > > Heiner > > > On 31.01.2019 13:09, Peter Ceiley wrote: > > Hi Heiner, > > > > A quick update on my testing with different pcie_aspm settings: > > > > pcie_aspm=off | no change > > pcie_aspm.policy=default | no change > > pcie_aspm.policy=performance | issue resolved > > pcie_aspm.policy=powersave | issue resolved > > pcie_aspm.policy=powersupersave | issue resolved > > > > It seems the new driver does not play nicely with the default ASPM policy. > > > > As requested, I've included an output of ethtool below when experiencing > > the issue - note that no errors are recorded. > > > > # ethtool -S enp3s0 > > NIC statistics: > > tx_packets: 2749 > > rx_packets: 4089 > > tx_errors: 0 > > rx_errors: 0 > > rx_missed: 0 > > align_errors: 0 > > tx_single_collisions: 0 > > tx_multi_collisions: 0 > > unicast: 4078 > > broadcast: 9 > > multicast: 2 > > tx_aborted: 0 > > tx_underrun: 0 > > > > David, I hope this helps for your user as well. I appreciate you sharing > > the bug ticket - thanks. > > > > Heiner, thanks very much for your help to date. > > > > Regards, > > > > Peter. > > > > On Thu, 31 Jan 2019 at 18:23, David Chang <dch...@suse.com> wrote: > >> > >> Hi Heiner, > >> > >> On Jan 31, 2019 at 07:35:30 +0100, Heiner Kallweit wrote: > >>> Hi David, two more things: > >>> > >>> 1. Could you please test a recent linux-next kernel? > >>> 2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19 > >>> and compare them. > >> > >> I'm sorry that I do not have the issue machine handy. I would ask > >> our user to do the test. Thanks! > >> > >> Regards, > >> David > >> > >>> > >>> Heiner > >>> > >>> > >>> On 31.01.2019 07:21, Heiner Kallweit wrote: > >>>> David, thanks for the link to the bug ticket. > >>>> I think only a proper bisect can help to find the offending commit. > >>>> > >>>> Heiner > >>>> > >>>> > >>>> On 31.01.2019 03:32, David Chang wrote: > >>>>> Hi, > >>>>> > >>>>> We had a similr case here. > >>>>> - Realtek r8169 receive performance regression in kernel 4.19 > >>>>> https://bugzilla.suse.com/show_bug.cgi?id=1119649 > >>>>> > >>>>> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880 > >>>>> The major symptom is there are many rx_missed count. > >>>>> > >>>>> > >>>>> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote: > >>>>>> Hi Peter, > >>>>>> > >>>>>> recently I had somebody where pcie_aspm=off for whatever reason didn't > >>>>>> do the trick, can you also check with pcie_aspm.policy=performance. > >>>>> > >>>>> We will give it a try later. > >>>>> > >>>>>> And please check with "ethtool -S <if>" whether the chip statistics > >>>>>> show a significant number of errors. > >>>>>> > >>>>>> If this doesn't help you may have to bisect to find the offending > >>>>>> commit. > >>>>> > >>>>> We had tried fallback driver to a few previous commits as following, > >>>>> but with no luck. > >>>>> > >>>>> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19) > >>>>> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1) > >>>>> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1) > >>>>> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1) > >>>>> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1) > >>>>> > >>>>> Thanks, > >>>>> David Chang > >>>>> > >>>>>> > >>>>>> Heiner > >>>>>> > >>>>>> > >>>>>> On 30.01.2019 10:59, Peter Ceiley wrote: > >>>>>>> Hi Heiner, > >>>>>>> > >>>>>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter > >>>>>>> and this made no difference. > >>>>>>> > >>>>>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and > >>>>>>> subsequently loaded the module in the running 4.19.18 kernel. I can > >>>>>>> confirm that this immediately resolved the issue and access to the NFS > >>>>>>> shares operated as expected. > >>>>>>> > >>>>>>> I presume this means it is an issue with the r8169 driver included in > >>>>>>> 4.19 onwards? > >>>>>>> > >>>>>>> To answer your last questions: > >>>>>>> > >>>>>>> Base Board Information > >>>>>>> Manufacturer: Alienware > >>>>>>> Product Name: 0PGRP5 > >>>>>>> Version: A02 > >>>>>>> > >>>>>>> ... and yes, the RTL8168 is the onboard network chip. > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Peter. > >>>>>>> > >>>>>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallwe...@gmail.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi Peter, > >>>>>>>> > >>>>>>>> I think the vendor driver doesn't enable ASPM per default. > >>>>>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs. > >>>>>>>> Few older systems seem to have issues with ASPM, what kind of > >>>>>>>> system / mainboard are you using? The RTL8168 is the onboard > >>>>>>>> network chip? > >>>>>>>> > >>>>>>>> Rgds, Heiner > >>>>>>>> > >>>>>>>> > >>>>>>>> On 29.01.2019 07:20, Peter Ceiley wrote: > >>>>>>>>> Hi Heiner, > >>>>>>>>> > >>>>>>>>> Thanks, I'll do some more testing. It might not be the driver - I > >>>>>>>>> assumed it was due to the fact that using the r8168 driver > >>>>>>>>> 'resolves' > >>>>>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this > >>>>>>>>> is > >>>>>>>>> a good idea. > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> > >>>>>>>>> Peter. > >>>>>>>>> > >>>>>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit > >>>>>>>>> <hkallwe...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi Peter, > >>>>>>>>>> > >>>>>>>>>> at a first glance it doesn't look like a typical driver issue. > >>>>>>>>>> What you could do: > >>>>>>>>>> > >>>>>>>>>> - Test the r8169.c from 4.18 on top of 4.19. > >>>>>>>>>> > >>>>>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an > >>>>>>>>>> effect. > >>>>>>>>>> > >>>>>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit. > >>>>>>>>>> > >>>>>>>>>> Any specific reason why you think root cause is in the driver and > >>>>>>>>>> not > >>>>>>>>>> elsewhere in the network subsystem? > >>>>>>>>>> > >>>>>>>>>> Heiner > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote: > >>>>>>>>>>> Hi Heiner, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for getting back to me. > >>>>>>>>>>> > >>>>>>>>>>> No, I don't use jumbo packets. > >>>>>>>>>>> > >>>>>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide > >>>>>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear > >>>>>>>>>>> when > >>>>>>>>>>> establishing a connection and is most notable, for example, on my > >>>>>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds > >>>>>>>>>>> on > >>>>>>>>>>> larger directories) to list the contents of each directory. Once a > >>>>>>>>>>> transfer begins on a file, I appear to get good bandwidth. > >>>>>>>>>>> > >>>>>>>>>>> I'm unsure of the best scientific data to provide you in order to > >>>>>>>>>>> troubleshoot this issue. Running the following > >>>>>>>>>>> > >>>>>>>>>>> netstat -s |grep retransmitted > >>>>>>>>>>> > >>>>>>>>>>> shows a steady increase in retransmitted segments each time I > >>>>>>>>>>> list the > >>>>>>>>>>> contents of a remote directory, for example, running 'ls' on a > >>>>>>>>>>> directory containing 345 media files did the following using > >>>>>>>>>>> kernel > >>>>>>>>>>> 4.19.18: > >>>>>>>>>>> > >>>>>>>>>>> increased retransmitted segments by 21 and the 'time' command > >>>>>>>>>>> showed > >>>>>>>>>>> the following: > >>>>>>>>>>> real 0m19.867s > >>>>>>>>>>> user 0m0.012s > >>>>>>>>>>> sys 0m0.036s > >>>>>>>>>>> > >>>>>>>>>>> The same command shows no retransmitted segments running kernel > >>>>>>>>>>> 4.18.16 and 'time' showed: > >>>>>>>>>>> real 0m0.300s > >>>>>>>>>>> user 0m0.004s > >>>>>>>>>>> sys 0m0.007s > >>>>>>>>>>> > >>>>>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in > >>>>>>>>>>> either case. > >>>>>>>>>>> > >>>>>>>>>>> dmesg XID: > >>>>>>>>>>> [ 2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g, > >>>>>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32 > >>>>>>>>>>> > >>>>>>>>>>> # lspci -vv > >>>>>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. > >>>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c) > >>>>>>>>>>> Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit > >>>>>>>>>>> Ethernet Controller > >>>>>>>>>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > >>>>>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+ > >>>>>>>>>>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > >>>>>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx- > >>>>>>>>>>> Latency: 0, Cache Line Size: 64 bytes > >>>>>>>>>>> Interrupt: pin A routed to IRQ 19 > >>>>>>>>>>> Region 0: I/O ports at d000 [size=256] > >>>>>>>>>>> Region 2: Memory at f7b00000 (64-bit, non-prefetchable) > >>>>>>>>>>> [size=4K] > >>>>>>>>>>> Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K] > >>>>>>>>>>> Capabilities: [40] Power Management version 3 > >>>>>>>>>>> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA > >>>>>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+) > >>>>>>>>>>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > >>>>>>>>>>> Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ > >>>>>>>>>>> Address: 0000000000000000 Data: 0000 > >>>>>>>>>>> Capabilities: [70] Express (v2) Endpoint, MSI 01 > >>>>>>>>>>> DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > >>>>>>>>>>> <512ns, L1 <64us > >>>>>>>>>>> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > >>>>>>>>>>> SlotPowerLimit 10.000W > >>>>>>>>>>> DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- > >>>>>>>>>>> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > >>>>>>>>>>> MaxPayload 128 bytes, MaxReadReq 4096 bytes > >>>>>>>>>>> DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- > >>>>>>>>>>> AuxPwr+ TransPend- > >>>>>>>>>>> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, > >>>>>>>>>>> Exit > >>>>>>>>>>> Latency L0s unlimited, L1 <64us > >>>>>>>>>>> ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ > >>>>>>>>>>> LnkCtl: ASPM L1 Enabled; RCB 64 bytes Disabled- > >>>>>>>>>>> CommClk+ > >>>>>>>>>>> ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- > >>>>>>>>>>> LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) > >>>>>>>>>>> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > >>>>>>>>>>> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, > >>>>>>>>>>> LTR+, > >>>>>>>>>>> OBFF Via message/WAKE# > >>>>>>>>>>> AtomicOpsCap: 32bit- 64bit- 128bitCAS- > >>>>>>>>>>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, > >>>>>>>>>>> LTR+, > >>>>>>>>>>> OBFF Disabled > >>>>>>>>>>> AtomicOpsCtl: ReqEn- > >>>>>>>>>>> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- > >>>>>>>>>>> SpeedDis- > >>>>>>>>>>> Transmit Margin: Normal Operating Range, > >>>>>>>>>>> EnterModifiedCompliance- ComplianceSOS- > >>>>>>>>>>> Compliance De-emphasis: -6dB > >>>>>>>>>>> LnkSta2: Current De-emphasis Level: -6dB, > >>>>>>>>>>> EqualizationComplete-, EqualizationPhase1- > >>>>>>>>>>> EqualizationPhase2-, EqualizationPhase3-, > >>>>>>>>>>> LinkEqualizationRequest- > >>>>>>>>>>> Capabilities: [b0] MSI-X: Enable+ Count=4 Masked- > >>>>>>>>>>> Vector table: BAR=4 offset=00000000 > >>>>>>>>>>> PBA: BAR=4 offset=00000800 > >>>>>>>>>>> Capabilities: [d0] Vital Product Data > >>>>>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error > >>>>>>>>>>> Not readable > >>>>>>>>>>> Capabilities: [100 v1] Advanced Error Reporting > >>>>>>>>>>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- > >>>>>>>>>>> UnxCmplt- > >>>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > >>>>>>>>>>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- > >>>>>>>>>>> UnxCmplt- > >>>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > >>>>>>>>>>> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- > >>>>>>>>>>> UnxCmplt- > >>>>>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > >>>>>>>>>>> CESta: RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ > >>>>>>>>>>> AdvNonFatalErr- > >>>>>>>>>>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- > >>>>>>>>>>> AdvNonFatalErr+ > >>>>>>>>>>> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- > >>>>>>>>>>> ECRCChkCap+ ECRCChkEn- > >>>>>>>>>>> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap- > >>>>>>>>>>> HeaderLog: 00000000 00000000 00000000 00000000 > >>>>>>>>>>> Capabilities: [140 v1] Virtual Channel > >>>>>>>>>>> Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > >>>>>>>>>>> Arb: Fixed- WRR32- WRR64- WRR128- > >>>>>>>>>>> Ctrl: ArbSelect=Fixed > >>>>>>>>>>> Status: InProgress- > >>>>>>>>>>> VC0: Caps: PATOffset=00 MaxTimeSlots=1 > >>>>>>>>>>> RejSnoopTrans- > >>>>>>>>>>> Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- > >>>>>>>>>>> Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 > >>>>>>>>>>> Status: NegoPending- InProgress- > >>>>>>>>>>> Capabilities: [160 v1] Device Serial Number > >>>>>>>>>>> 01-00-00-00-68-4c-e0-00 > >>>>>>>>>>> Capabilities: [170 v1] Latency Tolerance Reporting > >>>>>>>>>>> Max snoop latency: 71680ns > >>>>>>>>>>> Max no snoop latency: 71680ns > >>>>>>>>>>> Kernel driver in use: r8169 > >>>>>>>>>>> Kernel modules: r8169 > >>>>>>>>>>> > >>>>>>>>>>> Please let me know if you have any other ideas in terms of > >>>>>>>>>>> testing. > >>>>>>>>>>> > >>>>>>>>>>> Thanks! > >>>>>>>>>>> > >>>>>>>>>>> Peter. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit > >>>>>>>>>>> <hkallwe...@gmail.com> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote: > >>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I have been experiencing very poor network performance since > >>>>>>>>>>>>> Kernel > >>>>>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am > >>>>>>>>>>>>> experiencing > >>>>>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing > >>>>>>>>>>>>> with > >>>>>>>>>>>>> 4.20.4 & 4.19.18). > >>>>>>>>>>>>> > >>>>>>>>>>>>> If someone could guide me in the right direction, I'm happy to > >>>>>>>>>>>>> help > >>>>>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye > >>>>>>>>>>>>> on one > >>>>>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms > >>>>>>>>>>>>> differ in that I still have a network connection. I have > >>>>>>>>>>>>> attempted to > >>>>>>>>>>>>> reload the driver on a running system, but this does not > >>>>>>>>>>>>> improve the > >>>>>>>>>>>>> situation. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Using the proprietary r8168 driver returns my device to proper > >>>>>>>>>>>>> working order. > >>>>>>>>>>>>> > >>>>>>>>>>>>> lshw shows: > >>>>>>>>>>>>> description: Ethernet interface > >>>>>>>>>>>>> product: RTL8111/8168/8411 PCI Express Gigabit Ethernet > >>>>>>>>>>>>> Controller > >>>>>>>>>>>>> vendor: Realtek Semiconductor Co., Ltd. > >>>>>>>>>>>>> physical id: 0 > >>>>>>>>>>>>> bus info: pci@0000:03:00.0 > >>>>>>>>>>>>> logical name: enp3s0 > >>>>>>>>>>>>> version: 0c > >>>>>>>>>>>>> serial: > >>>>>>>>>>>>> size: 1Gbit/s > >>>>>>>>>>>>> capacity: 1Gbit/s > >>>>>>>>>>>>> width: 64 bits > >>>>>>>>>>>>> clock: 33MHz > >>>>>>>>>>>>> capabilities: pm msi pciexpress msix vpd bus_master > >>>>>>>>>>>>> cap_list > >>>>>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt > >>>>>>>>>>>>> 100bt-fd > >>>>>>>>>>>>> 1000bt-fd autonegotiation > >>>>>>>>>>>>> configuration: autonegotiation=on broadcast=yes > >>>>>>>>>>>>> driver=r8169 > >>>>>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25 > >>>>>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s > >>>>>>>>>>>>> resources: irq:19 ioport:d000(size=256) > >>>>>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff > >>>>>>>>>>>>> > >>>>>>>>>>>>> Kind Regards, > >>>>>>>>>>>>> > >>>>>>>>>>>>> Peter. > >>>>>>>>>>>>> > >>>>>>>>>>>> Hi Peter, > >>>>>>>>>>>> > >>>>>>>>>>>> the description "poor network performance" is quite vague, > >>>>>>>>>>>> therefore: > >>>>>>>>>>>> > >>>>>>>>>>>> - Can you provide any measurements? > >>>>>>>>>>>> - iperf results before and after > >>>>>>>>>>>> - statistics about dropped packets (rx and/or tx) > >>>>>>>>>>>> - Do you use jumbo packets? > >>>>>>>>>>>> > >>>>>>>>>>>> Also help would be a "lspci -vv" output for the network card and > >>>>>>>>>>>> the dmesg output line with the chip XID. > >>>>>>>>>>>> > >>>>>>>>>>>> Heiner > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>> > > > >