Hi,

I run Fedora 32 and since kernels in the 5.10 series I have been unable to boot without getting a panic in the sfc module. I tried on 5.11.12 tonight and the crash still occurs. I have tried reporting this via Fedora channels but the silence has been deafening and I suspect this is an upstream issue anyway.

I'm running an Asus X570-Pro with a 3700x processor, 64GB ECC RAM and various nvme/SATA disks. I have a dual port Solarflare SFN6122F PCIE card installed that shows up in lspci as:

0b:00.0 Ethernet controller [0200]: Solarflare Communications SFC9020 10G Ethernet Controller [1924:0803] 0b:00.1 Ethernet controller [0200]: Solarflare Communications SFC9020 10G Ethernet Controller [1924:0803]

I have attached jpegs of the crash on the Fedora bugzilla entry https://bugzilla.redhat.com/show_bug.cgi?id=1924982 but since I figure many here won't want to download a 2.5MB attachment from a slow bugzilla I'll try to transcribe the relevant bits here:

BUG: kernel NULL pointer dereference, address: 0000000000000104
#PF: supervisor write acess in kernel mode
#PF: error_code(0x8002) - not-present page
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
CPU: 0 PID: 1067 Comm: rngd Not tainted 5.11.12-100.fc32.x86_64 #1
Hardware name: System manufacturer System Product Name/PRIME X570-PRO, BIOS 3405 02/01/2021
RIP: 0010:efx_farch_ev_process+0x3d2/0x910 [sfc]
Code: c0 02 39 f0 76 34 c1 fe 02 41 03 b6 28 07 00 00 83 e1 03 49 8b 84 f6 d0 00 00 00 48 8b 94 c8 80 09 00 00 b0 01 00 00 00 31 c9 <f0> 8f b1 8a 04 81 00 00 05 c0 0f 05 37 03 00 00 48 8d 74 24 20 4c
RSP: 0000:ffff9e04c0003e78 EFLAGS: 000010246
RAX: 0000000000000001 RBX: ffff89548a9b5000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8954832c0940
RBP: 000000000000001e R08: ffff9e04c0003f50 R09: ffff89636ea2b140
R10: 0000000000000000 R11: ffffffffb9a060c0 R12: 00000000000000
R13: ffff8954832c0940 R14: ffff8954832c0940 R15: ffff89548a9b5480
FS: 00007ff835b31700(0000) GS:ffff89636ea00000(0000) knlGS:0000000000000000
CS: 0010 DA: 0000 ES: 0000 CR8: 0000000000050833
CR2: 0000000000000104 CR3: 000000011c41a000 CR4: 0000000000350ef0
Call Trace:
<IRQ>
? trigger_load_balance+0x5a/0x220
efx_poll_0xcb/0x380 [sfc]
net_rx_action+0x136/0x400
__do_softirq+0xcf/0x20f
asm_call_irq_on_stack+0x12/0x20
</IRQ>
do_softirq_own_stack_0x37/0x40
__irq_exit_rcu+0xbf/0x100
common_interrupt+0x74/0x130
? asm_common_interrupt+0x8/0x40
asm_common_interrupt+0x1e/0x40
RIP: 0033:0x7ff836732b00

I won't guarantee there are no typos in that lot since the picture is a bit fuzzy and so are my eyes after all that. You can find the original on the referenced bz above.

No problems on 5.9.16 which is the last pre-5.10 kernel available for F32. Everything I've tried since 5.10 goes pop.

In case it helps, this is what sfcboot reports for one of the cards (the other is the same)

enp11s0f0np0:
  Boot image                            Option ROM only
    Link speed                          Negotiated automatically
    Link-up delay time                  5 seconds
    Banner delay time                   2 seconds
    Boot skip delay time                5 seconds
    Boot type                           Disabled
  PF MSI-X interrupt limit              512
  SR-IOV                                Disabled
  Virtual Functions on each PF          127
  VF MSI-X interrupt limit              1

and sfupdate:

enp11s0f0np0 - MAC: 00-0x-xx-0x-xx-xx (intentionally obscured)
    Firmware version:   v7.6.9
    Controller type:    Solarflare SFC9000 family
    Controller version: v3.3.2.1000
    Boot ROM version:   v5.2.2.1004

Just prior to the crash I get a pair of messages that don't look particularly right but I get these on 5.9.16 too and that survives.

[    9.027961] sfc 0000:0b:00.0 enp11s0f0np0: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0 [    9.029895] sfc 0000:0b:00.1 enp11s0f1np1: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0

I'm not subscribed to the list so I'd be grateful for a cc on any replies or if I'm on entirely the wrong mailing list, feel free to let me know that too! I can supply any more information that would be useful to get this fixed.

Thanks,

Trevor

Reply via email to