On 16.03.2019 18:08, Alexander Duyck wrote: > On Sat, Mar 16, 2019 at 8:12 AM Heiner Kallweit <hkallwe...@gmail.com> wrote: >> >> On 16.03.2019 15:38, VDR User wrote: >>>> Part of the issue though is that we don't know how reliable that test >>>> was. I believe Derek he hasn't had any crashes, but he wasn't confident >>>> that it had actually resolved the issue. >>> >>> Previously I thought I could easily & consistently reproduce the crash >>> but the more testing I did, the more I realized that wasn't the case. >>> That's why my confidence was low in that reversing commit 5317d5c6d47e >>> ("r8169: use napi_consume_skb where possible") fixed it. I felt like I >>> needed to do a lot more testing over the weekend to be sure. But, I >>> can now confirm that reversing that commit did not solve the problem. >>> I didn't ifdown/ifup after the crash so the nic eventually recovered >>> on its own I guess. The `ethtool -S` output is: >>> >>> NIC statistics: >>> tx_packets: 5370650 >>> rx_packets: 57340787 >>> tx_errors: 0 >>> rx_errors: 0 >>> rx_missed: 26 >>> align_errors: 0 >>> tx_single_collisions: 0 >>> tx_multi_collisions: 0 >>> unicast: 57332905 >>> broadcast: 6409 >>> multicast: 1473 >>> tx_aborted: 0 >>> tx_underrun: 0 >>> >>> The dmesg log looks the same as before: >>> >>> [95579.984062] ------------[ cut here ]------------ >>> [95579.984142] NETDEV WATCHDOG: enp4s0 (r8169): transmit queue 0 timed out >>> [95579.984224] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:461 >>> dev_watchdog+0x1bb/0x1e0 >>> [95579.984276] Modules linked in: snd_hda_codec_hdmi >>> snd_hda_codec_realtek snd_hda_codec_generic ohci_pci snd_hda_intel >>> snd_hda_codec snd_hwdep xhci_pci ohci_hcd ehci_pci xhci_hcd ehci_hcd >>> usbcore snd_hda_core usb_common snd_pcm snd_timer snd soundcore nfsd >>> auth_rpcgss oid_registry lockd grace sunrpc ip_tables x_tables ipv6 >>> [95579.984354] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.0.0-amd #1 >>> [95579.984387] Hardware name: ECS A75F-A/A75F-A, BIOS 4.6.5 09/14/2011 >>> [95579.984422] EIP: dev_watchdog+0x1bb/0x1e0 >>> [95579.984454] Code: 8b 50 3c 89 f8 e8 3d aa 0a 00 8b 7e f4 eb a4 89 >>> f8 c6 05 e7 1c 6d c1 01 e8 72 4f fd ff 53 50 57 68 78 05 66 c1 e8 25 >>> ad ba ff <0f> 0b 83 c4 10 eb c9 eb 1c 8d b4 26 00 00 00 00 8d b4 26 00 >>> 00 00 >>> [95579.986189] EAX: 0000003b EBX: 00000000 ECX: 00000800 EDX: 00000103 >>> [95579.986224] ESI: f4cc2264 EDI: f4cc2000 EBP: f4c99f74 ESP: f4c99f4c >>> [95579.986259] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210296 >>> [95579.986292] CR0: 80050033 CR2: b7c644f0 CR3: 0dfd2000 CR4: 00000690 >>> [95579.986325] Call Trace: >>> [95579.986356] <SOFTIRQ> >>> [95579.986389] ? qdisc_put_unlocked+0x40/0x40 >>> [95579.986423] call_timer_fn+0x19/0xa0 >>> [95579.986456] run_timer_softirq+0x337/0x380 >>> [95579.986488] ? qdisc_put_unlocked+0x40/0x40 >>> [95579.986521] ? rcu_process_callbacks+0xcb/0x380 >>> [95579.986555] __do_softirq+0xd6/0x21c >>> [95579.986586] ? __irqentry_text_end+0x18/0x18 >>> [95579.986619] call_on_stack+0x10/0x60 >>> [95579.986646] </SOFTIRQ> >>> [95579.986674] ? irq_exit+0x91/0xc0 >>> [95579.986701] ? smp_apic_timer_interrupt+0x56/0xa0 >>> [95579.986731] ? apic_timer_interrupt+0xd5/0xdc >>> [95579.986761] ? acpi_idle_enter_s2idle+0x60/0x60 >>> [95579.986790] ? cpuidle_enter_state+0x122/0x360 >>> [95579.986821] ? cpuidle_enter+0xf/0x20 >>> [95579.986850] ? call_cpuidle+0x1c/0x40 >>> [95579.986878] ? do_idle+0x1e6/0x220 >>> [95579.986906] ? cpu_startup_entry+0x25/0x40 >>> [95579.986934] ? start_secondary+0x1a5/0x220 >>> [95579.986963] ? startup_32_smp+0x15f/0x164 >>> [95579.986991] ---[ end trace 2e8d77bb3c9d2d45 ]--- >>> >>> Please let me know if there's anything I can do to help. >>> Derek >>> >> The other changes between 4.20 and 5.0 don't look like they could cause >> the issue. And two critical ones have been reverted already. >> So what would need to be done is bisecting the issue. >> >> Heiner > > So I took a quick look through the git log. Have you tried taking a > look at the following commit as a possible culprit? > commit e782410ed2378a2ddac58d944c3cf0c6f96b1ff3 > Author: Heiner Kallweit <hkallwe...@gmail.com> > Date: Sat Dec 15 16:25:05 2018 +0100 > > r8169: improve spurious interrupt detection > > Improve detection of spurious interrupts by checking against the > interrupt mask as currently set in the chip. > > Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com> > Signed-off-by: David S. Miller <da...@davemloft.net> > > This patch seems problematic for a few different reasons. The first > being that adding a register read to the interrupt handler will > actually increase the latency for your interrupt handler as a MMIO > read is not a fast operation compared to a memory read. You already > had to read status, and adding a read for the interrupt mask will just > double the overhead as the reads have to be performed sequentially. > Secondly, are we guaranteed that the read of the register will always > be reliable? I know I have seen plenty of hardware issues in the past > where reading a register in the middle of an event can sometimes lead > to the value read being incorrect. > > The only other patch in the set between 4.20 and 5.0 that catches my eye is: > commit 76085c9e7eaebf0caa2476f77380efc3d55b050c > Author: Heiner Kallweit <hkallwe...@gmail.com> > Date: Thu Nov 22 22:03:08 2018 +0100 > > r8169: replace macro TX_FRAGS_READY_FOR with a function > > Replace macro TX_FRAGS_READY_FOR with function rtl_tx_slots_avail > to make code cleaner and type-safe. > > Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com> > Signed-off-by: David S. Miller <da...@davemloft.net> > > However I don't see anything that should be wrong there that is > causing this, but instrumenting the points where we are taking action > based on that with a trace_printk probably wouldn't hurt as it would > help us to verify that we are stopping things in xmit and not getting > around to waking them up for whatever reason. Also a minor > optimization you could do on the patch is to replace the call to > netif_wake_queue in rtl8169_start_xmit with netif_start_queue as you > don't need to actually wake up the queue since you are still in mid > transmit so you just need to reset the bit so it doesn't prevent the > next transmit. > > Hope that helps. > Definitely. Thanks a lot for the valuable feedback.
> - Alex > Heiner