Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

Ronald Klop Mon, 07 Mar 2022 05:48:13 -0800

Dear Mark Johnston,

I did some binary search in the kernels and came to the conclusion that 
https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
 still works and 
https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
 panics.


I suspect your commit in 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.

Last panic:

panic: vm_fault failed: ffff00000046e708 error 1
cpuid = 1
time = 1646660058
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2e8
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x96000004
_rm_rlock_debug() at _rm_rlock_debug+0x8c
osd_get() at osd_get+0x5c
zio_execute() at zio_execute+0xf8
taskqueue_run_locked() at taskqueue_run_locked+0x178
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 0 tid 100129 ]
Stopped at      kdb_enter+0x44: undefined       f902011f
db>

A more recent kernel (912df91) still panics. See below.

Do you have time to look into this? What can I provide in information to help?

Regards,
Ronald.


Van: Ronald Klop <ronald-li...@klop.ws>
Datum: maandag, 7 maart 2022 11:38
Aan: Mark Millard <mark...@yahoo.com>
CC: bob prohaska <f...@www.zefox.net>, freebsd-current 
<freebsd-current@freebsd.org>, freebsd-...@freebsd.org
Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 
28))


Yes, I spoke to soon too. Often it crashes as soon as I start a parallel 
poudriere build. But this time it went very far. As soon as nightly backups 
kicked in it was game over again.
I had read the mail of Bob on the arm@ ML. But I wanted to let the conclusion 
that it is about the same problem to the developers. (Have seen enough of wrong 
guessing of causes in my work. )

I will need to go further into the binary search of working kernels.

This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC 2022

Fatal data abort:x0: ffff000000f1efd8 x0: ffff000000f1efd8 (mac_policy_rm + 0) (mac_policy_rm + 0)x1: 2 x1: 2x2: ffff00000087dcf2 x2: ffff00000087dcf2 (cam_status_table + 2f28a)(cam_status_table + 2f28a) x3: ffff00000087dcf2x3: ffff00000087dcf2 (cam_status_table + 2f28a) (cam_status_table + 2f28a)x4: 102 x4: 102x5: 7 x5: 1x6: 0 x6: ffx7: 0 x7: ffffa00011fc2800x8: 1x8: 1 x9: ffff000000f37c10x9: ffff0000419d9090 (pcpu0 + 90) (g_ctx + 40278fe4)x10: ffffa0017be2a600 x10: ffffa000010fa600x11: 394aed08d0003a48x12: 350001a8b946a108 x11: 0x12: ffff000000f37c10 x13: badecce4 (pcpu0 + 90)x13: ffffa0001fbde6b0 x14: 0x14: 4965ae49 x15: 1x15: 1000193 x16: ffff0000016a4238x16: ffff000100482d38 (__stop_set_modmetadata_set + d00) (__stop_set_modmetadata_set + 448)x17: ffff00000044a998 x17: ffff00000058ff30 (free + 0) (if_inc_counter + 0)x18: ffff0000b49a23c0 x18: ffff000103f11b80 (g_ctx + b3242314)(next_index + 3a228c0) x19: 102x19: 102 x20: ffff0000b49a2428x20: ffff000103f11be8 (g_ctx + b324237c) (next_index + 3a22928)

 x21: ffff00000087dcf2 x21: ffff00000087dcf2 (cam_status_table + 2f28a) 
(cam_status_table + 2f28a)

 x22: ffff000000f1efd8 x22: ffff000000f1efd8 (mac_policy_rm + 0) (mac_policy_rm 
+ 0)

 x23: ffff00000086f107 x23:                0 (cam_status_table + 2069f)

 x24: ffffa0001fbde6c8 x24: ffffa0008cba0d00
 x25:                0

 x25: ffff00000088aa11 x26:                4 (do_execve.fexecv_proc_title + 
76b7)

 x27:                0 x26: ffffa0017be2a600
 x28: ffff00010209fcf0
 x27: ffffa00025626a80 (next_index + 1bb0a30)

 x28: ffff000103f11ce0 x29: ffff0000b49a23e0 (next_index + 3a22a20) (g_ctx + 
b3242334)

 x29: ffff000103f11ba0  sp: ffff0000b49a23c0
 (next_index + 3a228e0)  lr: ffff00000046ef98
  sp: ffff000103f11b80
 (_rm_runlock_debug + 60)  lr: ffff00000046ef98
 elr: ffff00000046dc0c (_rm_runlock_debug + 60) (_rm_assert + a4)

 elr: ffff00000046dc0cspsr:               45
 (_rm_assert + a4) far:               10

 esr:         96000004
spsr:               45

panic: data abort in critical section or under mutex
cpuid = 1
time = 1646609483
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2d4
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x96000004
_rm_assert() at _rm_assert+0xa4
_rm_runlock_debug() at _rm_runlock_debug+0x5c
mac_inpcb_check_deliver() at mac_inpcb_check_deliver+0x74
tcp_input_with_port() at tcp_input_with_port+0xab4
tcp_input() at tcp_input+0xc
ip_input() at ip_input+0x2e8
netisr_dispatch_src() at netisr_dispatch_src+0xe4
ether_demux() at ether_demux+0x178
ether_nh_input() at ether_nh_input+0x3e8
netisr_dispatch_src() at netisr_dispatch_src+0xe4
ether_input() at ether_input+0x80
if_input() at if_input+0xc
gen_intr() at gen_intr+0x444
ithread_loop() at ithread_loop+0x2a0
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 12 tid 100063 ]
Stopped at      kdb_enter+0x44: undefined       f902011f
db>

NB: db> reboot/reset/halt does not work on my RPI4. Luckily I have a wifi 
connected power switch on it.

Regards,
Ronald.

Van: Mark Millard <mark...@yahoo.com>

Datum: maandag, 7 maart 2022 02:01
Aan: Ronald Klop <ronald-li...@klop.ws>
CC: freebsd-current <freebsd-current@freebsd.org>, bob prohaska 
<f...@www.zefox.net>
Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 
28))


From: Ronald Klop <ronald-lists_at_klop.ws> wrote on
Date: Sun, 6 Mar 2022 23:22:42 +0100 (CET) :

> Did some binary search with kernels from artifact.ci.freebsd.org.
>
> I suspect "rmlock: Micro-optimize read locking" as cause.
>
> 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a
>
>
> And "rmlock: Add required compiler barriers to _rm_runlock()" as solution.
>
> 
https://cgit.freebsd.org/src/commit/?id=89ae8eb74e87ac19aa2d7abe4ba16bcccd32bb9f
>
>
> So I probably just had a bad day.

Well, there is a report of a buildkernel crash after that pair:

https://lists.freebsd.org/archives/freebsd-arm/2022-March/001078.html

that references additional information at:

http://www.zefox.net/~fbsd/rpi3/crashes/20220304/readme

and reported:

QUOTE
The console connection dropped before the crash (unrelated) I didn't
get the preamble, all  I have is the backtrace and buildkernel log.
Here's the backtrace:
db> bt
Tracing pid 14795 tid 100098 td 0xffffa00017815600
db_trace_self() at db_trace_self
db_stack_trace() at db_stack_trace+0x11c
db_command() at db_command+0x368
db_command_loop() at db_command_loop+0x54
db_trap() at db_trap+0xf8
kdb_trap() at kdb_trap+0x1cc
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0xf2000000
kdb_enter() at kdb_enter+0x44
vpanic() at vpanic+0x1b0
panic() at panic+0x44
data_abort() at data_abort+0x2e8
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x96000004
_rm_rlock_debug() at _rm_rlock_debug+0x8c
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x140
sysctl_root() at sysctl_root+0x1ac
userland_sysctl() at userland_sysctl+0x140
sys___sysctl() at sys___sysctl+0x68
do_el0_sync() at do_el0_sync+0x520
handle_el0_sync() at handle_el0_sync+0x40
--- exception, esr 0x56000000
END QUOTE

The above material does reference _rm_rlock_debug . Might be
related?

The readme reports:

main-n253603-0b25cbc79d3: Thu Mar  3 22:48:31 PST 2022

for the system doing the buildkernel. This is after
89ae8eb74e8 .

(It also mentions another panic earlier in the week,
apparently not reported to the lists at the time.)

===
Mark Millard
marklmi at yahoo.com

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

Reply via email to