On 6/20/17 5:41 PM, Ben Greear wrote: > On 06/20/2017 11:05 AM, Michal Kubecek wrote: >> On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote: >>> On 06/14/2017 03:25 PM, David Ahern wrote: >>>> On 6/14/17 4:23 PM, Ben Greear wrote: >>>>> On 06/13/2017 07:27 PM, David Ahern wrote: >>>>> >>>>>> Let's try a targeted debug patch. See attached >>>>> >>>>> I had to change it to pr_err so it would go to our serial console >>>>> since the system locked hard on crash, >>>>> and that appears to be enough to change the timing where we can no >>>>> longer >>>>> reproduce the problem. >>>> >>>> >>>> ok, let's figure out which one is doing that. There are 3 debug >>>> statements. I suspect fib6_del_route is the one setting the state to >>>> FWS_U. Can you remove the debug prints in fib6_repair_tree and >>>> fib6_walk_continue and try again? >>> >>> We cannot reproduce with just that one printf in the kernel either. It >>> must change the timing too much to trigger the bug. >> >> You might try trace_printk() which should have less impact (don't forget >> to enable /proc/sys/kernel/ftrace_dump_on_oops). > > We cannot reproduce with trace_printk() either.
I think that suggests the walker state is set to FWS_U in fib6_del_route, and it is the FWS_U case in fib6_walk_continue that triggers the fault -- the null parent (pn = fn->parent). So we have the 2 areas of code that are interacting. I'm on a road trip through the end of this week with little time to focus on this problem. I'll get back to you another suggestion when I can.
