Hi I'd like to get some feedback on an issue that has popped up on newer systems (with increased load).
The system uses an older CPU (Atom) that uses an integrated MAC. When flooding the NIC with multicast traffic (and multiple listeners), we get the following: ----- Aug 16 01:21:55 dss kernel: [ 1357.210634] NETDEV WATCHDOG: eth0 (pch_gbe): transmit queue 0 timed out Aug 16 01:21:55 dss kernel: [ 1357.210680] WARNING: CPU: 1 PID: 1187 at net/sched/sch_generic.c:466 dev_watchdog+0x1b6/0x1c0 Aug 16 01:21:55 dss kernel: [ 1357.210683] Modules linked in: 8021q garp stp mrp llc rfkill nft_chain_nat_ipv4 nf_nat_ipv4 xt_REDIRECT nf_nat nf_log_ipv4 nf_log_common nft_counter xt_LOG i2c_dev ie6xx_wdt lpc_sch xt_multiport i2c_i801 xt_pkttype xt_recent xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c xt_tcpudp nft_compat nf_tables nfnetlink coretemp kvm irqbypass serio_raw pcspkr gma500_gfx pch_can can_dev drm_kms_helper drm pch_uart sg pch_dma pch_udc i2c_algo_bit udc_core fb_sys_fops syscopyarea pch_phub sysfillrect evdev sysimgblt video pcc_cpufreq button acpi_cpufreq ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb crypto_simd cryptd aes_i586 aufs(OE) sd_mod i2c_isch psmouse mfd_core e1000e spi_topcliff_pch ahci ohci_pci libahci ohci_hcd ehci_pci libata ehci_hcd sdhci_pci Aug 16 01:21:55 dss kernel: [ 1357.210802] usbcore cqhci pch_gbe sdhci scsi_mod ptp_pch mmc_core mii ptp pps_core gpio_pch usb_common [last unloaded: lpc_sch] Aug 16 01:21:55 dss kernel: [ 1357.210831] CPU: 1 PID: 1187 Comm: mysqld Tainted: G OE 4.19.0-9-686 #1 Debian 4.19.118-2+deb10u1 Aug 16 01:21:55 dss kernel: [ 1357.210835] Hardware name: EKF Elektronik GmbH PC2-LIMBO/PC2-LIMBO, BIOS 094 2017-02-01 Aug 16 01:21:55 dss kernel: [ 1357.210844] EIP: dev_watchdog+0x1b6/0x1c0 Aug 16 01:21:55 dss kernel: [ 1357.210853] Code: 8b 50 3c 89 f8 e8 ca cd 10 00 8b 7e f0 eb a3 89 f8 c6 05 eb 4e 90 d7 01 e8 b7 dc fc ff 53 50 57 68 44 f7 82 d7 e8 4e ee ae ff <0f> 0b 83 c4 10 eb c9 8d 76 00 3e 8d 74 26 00 55 89 e5 57 56 89 d6 Aug 16 01:21:55 dss kernel: [ 1357.210859] EAX: 0000003b EBX: 00000000 ECX: f473ccac EDX: 00000007 Aug 16 01:21:55 dss kernel: [ 1357.210864] ESI: f41fc2e8 EDI: f41fc000 EBP: f417df68 ESP: f417df40 Aug 16 01:21:55 dss kernel: [ 1357.210871] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010292 Aug 16 01:21:55 dss kernel: [ 1357.210876] CR0: 80050033 CR2: b78e1010 CR3: 1bbd7000 CR4: 000006d0 Aug 16 01:21:55 dss kernel: [ 1357.210880] Call Trace: Aug 16 01:21:55 dss kernel: [ 1357.210887] <SOFTIRQ> Aug 16 01:21:55 dss kernel: [ 1357.210903] ? pfifo_fast_enqueue+0xf0/0xf0 Aug 16 01:21:55 dss kernel: [ 1357.210913] call_timer_fn+0x2f/0x130 Aug 16 01:21:55 dss kernel: [ 1357.210921] ? pfifo_fast_enqueue+0xf0/0xf0 Aug 16 01:21:55 dss kernel: [ 1357.210930] run_timer_softirq+0x1bd/0x3f0 Aug 16 01:21:55 dss kernel: [ 1357.210944] __do_softirq+0xb2/0x275 Aug 16 01:21:55 dss kernel: [ 1357.210955] ? __softirqentry_text_start+0x8/0x8 Aug 16 01:21:55 dss kernel: [ 1357.210964] call_on_stack+0x12/0x50 Aug 16 01:21:55 dss kernel: [ 1357.210969] </SOFTIRQ> Aug 16 01:21:55 dss kernel: [ 1357.210977] ? irq_exit+0xc5/0xd0 Aug 16 01:21:55 dss kernel: [ 1357.210986] ? smp_apic_timer_interrupt+0x6c/0x130 Aug 16 01:21:55 dss kernel: [ 1357.210996] ? apic_timer_interrupt+0xd5/0xdc Aug 16 01:21:55 dss kernel: [ 1357.211007] ? nmi+0x8b/0x198 ---- It looks eerily similar to the issue reported on this mailinglist 8 years ago: https://www.spinics.net/lists/netdev/msg198234.html where locking was tweaked to compensate. When I compare the different kernels (4.19.132, 5.8.7), the code base has changed little in the driver, the locking was changed a bit (wrt patch where it was confirmed to be a fix): 1. netif_tx_lock is used instead of spin_lock(&tx_ring->tx_lock); 2. locking has been removed in pch_gbe_xmit_frame Is this again an issue with missing locks? Since it has been quite some time since I did some kernel work, I thought it better to check first. -- g. Marc