06.01.2018, 00:20, "Tobias Hommel" <netdev-l...@genoetigt.de>: > Hi,
Hi Tobias, > I'm running into a NULL pointer dereference after updating from Linux 4.1.6 to > 4.14.11 (see kernel log below). I tried 4.14.3 initially which did not work > either. > Anyone has an idea what is happening here? > > The affected machine has 2 active ethernet interfaces (igb driver) and acts as > a VPN gateway running strongswan. There are several hundreds of IPSec > roadwarriors connecting to eth1. eth0 connects to an infrastructure running an > HTTP server. > During my tests these roadwarriors connect to the gateway, sometimes download > a > large file from the HTTP server, disconnect and after a random delay repeat > these steps. > > Some observations I made: > * SMP Affinity for IRQs of the NICs Rx/Tx queues (/proc/irq/$IRQ/smp_affinity) > * all affinities set to default ff is broken > * setting affinity for all queues of both interfaces to the same CPU seems > to > work fine (running stable for more than 1 day now) > * setting affinity of eth0 queues to CPU 1 and affinity of eth1 queues to > CPU > 2 is broken and seems to always trigger the bug on CPU 1 > * the top 6 entries of the call trace are the same every time the system > crashes, the other entries differ sometimes > > The bug is 100% reproducible on the Intel Atom machine from the log below and > also on a HP ProLiant Gen6 (also igb driver). > I can, of course, provide further information (CPU, NIC, kernel config, more > traces, etc.) if required. > If helpful I could also run tests on HP ProLiant Gen9 which has different NICs > (tg3). > > [ 7998.489094] BUG: unable to handle kernel NULL pointer dereference at > 0000000000000020 > [ 7998.496993] IP: xfrm_lookup+0x2a/0x7e0 > [ 7998.500759] PGD 0 P4D 0 > [ 7998.503316] Oops: 0000 [#1] SMP PTI > [ 7998.506835] Modules linked in: > [ 7998.509929] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.14.11 #3 > [ 7998.516244] Hardware name: To be filled by O.E.M. CAR-2051/CAR, BIOS 1.01 > 07/11/2016 > [ 7998.524039] task: ffff8826bb118000 task.stack: ffff947ac00f0000 > [ 7998.530004] RIP: 0010:xfrm_lookup+0x2a/0x7e0 > [ 7998.534298] RSP: 0018:ffff947ac00f3b60 EFLAGS: 00010246 > [ 7998.539550] RAX: 0000000000000000 RBX: ffffffff93074040 RCX: > 0000000000000000 > [ 7998.546709] RDX: ffff947ac00f3bd8 RSI: 0000000000000000 RDI: > ffffffff93074040 > [ 7998.553868] RBP: ffffffff93074040 R08: 0000000000000002 R09: > 0000000000000001 > [ 7998.561026] R10: 0000000000000032 R11: 0000000000000000 R12: > ffff947ac00f3bd8 > [ 7998.568212] R13: 0000000000000000 R14: 0000000000000002 R15: > ffff8826b69a8078 > [ 7998.575395] FS: 0000000000000000(0000) GS:ffff8826bfc80000(0000) > knlGS:0000000000000000 > [ 7998.583550] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 7998.589324] CR2: 0000000000000020 CR3: 00000001781da000 CR4: > 00000000001006e0 > [ 7998.596482] Call Trace: > [ 7998.598959] __xfrm_route_forward+0xa4/0x110 > [ 7998.603263] ip_forward+0x3e0/0x450 > [ 7998.606778] ? ip_rcv_finish+0x61/0x3a0 > [ 7998.610645] ip_rcv+0x2c4/0x390 > [ 7998.613818] ? inet_del_offload+0x30/0x30 > [ 7998.617857] __netif_receive_skb_core+0x751/0xb00 > [ 7998.622562] ? skb_send_sock+0x40/0x40 > [ 7998.626356] ? netif_receive_skb_internal+0x47/0xf0 > [ 7998.631252] netif_receive_skb_internal+0x47/0xf0 > [ 7998.635987] napi_gro_receive+0x70/0x90 > [ 7998.639835] gro_cell_poll+0x53/0x90 > [ 7998.643439] net_rx_action+0x1fc/0x310 > [ 7998.647210] ? rebalance_domains+0x101/0x2b0 > [ 7998.651500] __do_softirq+0xd5/0x1cf > [ 7998.655105] run_ksoftirqd+0x14/0x30 > [ 7998.658712] smpboot_thread_fn+0xf9/0x150 > [ 7998.662723] kthread+0xef/0x130 > [ 7998.665893] ? sort_range+0x20/0x20 > [ 7998.669404] ? kthread_park+0x60/0x60 > [ 7998.673098] ret_from_fork+0x1f/0x30 > [ 7998.676674] Code: 00 41 57 41 56 45 89 c6 41 55 41 54 49 89 f5 55 53 49 89 > d4 48 89 fb 48 83 ec 40 65 48 8b 04 25 28 00 00 00 48 89 44 24 38 31 c0 <48> > 8b 46 20 48 85 c9 44 0f b7 38 c7 44 24 0c 00 00 00 00 0f 84 > [ 7998.695681] RIP: xfrm_lookup+0x2a/0x7e0 RSP: ffff947ac00f3b60 > [ 7998.701479] CR2: 0000000000000020 > [ 7998.704799] ---[ end trace 0544b1946919baad ]--- > [ 7998.709442] Kernel panic - not syncing: Fatal exception in interrupt > [ 7998.715918] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation > range: 0xffffffff80000000-0xffffffffbfffffff) this error doesn't look like the last version kernel, I think this problem NIC driver. What is the use network ethernet card model? And which driver version you use? > Best regards, > > Tobias Hommel Ozgur