On 8/31/2018 10:49 AM, Brian Rak wrote:
We've upgraded a few machines to a 4.18.3 kernel and we're running
into weird IPv6 neighbor discovery issues. Basically, the machines
stop responding to inbound IPv6 neighbor solicitation requests, which
very quickly breaks all IPv6 connectivity.
It seems like the routing table gets confused:
# ip -6 route get fe80::4e16:fc00:c7a0:7800 dev br0
RTNETLINK answers: Network is unreachable
# ping6 fe80::4e16:fc00:c7a0:7800 -I br0
connect: Network is unreachable
yet
# ip -6 route | grep fe80 | grep br0
fe80::/64 dev br0 proto kernel metric 256 pref medium
fe80::4e16:fc00:c7a0:7800 is the link-local IP of the server's default
gateway.
In this case, br0 has a single adapter attached to it.
I haven't been able to come up with any sort of reproduction steps
here, this seems to happen after a few days of uptime in our
environment. The last known good release we have here is 4.17.13.
Any suggestions for troubleshooting this? Sometimes we see machines
fix themselves, but we haven't been able to figure out what's
happening that helps.
So, we're still seeing this on 4.19.13. I've been investigating this a
little further and have discovered a few more things:
The server also fails to respond to IPv6 neighbor discovery requests:
16:12:10.181769 IP6 fe80::629c:9fff:fe22:4b80 > ff02::1:ff00:33: ICMP6,
neighbor solicitation, who has 2001:x::33, length 32
But this IP is configured properly:
# ip -6 addr show dev br0
7: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2001:x::33/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::ec4:7aff:fe88:c48c/64 scope link
valid_lft forever preferred_lft forever
I found some instructions that suggest using `perf` to determine where
packets are getting dropped, so I tried: perf record -g -a -e
skb:kfree_skb; perf script, which showed me this seemingly relevant
places (and a bunch of other drops):
swapper 0 [037] 161501.062542: skb:kfree_skb:
skbaddr=0xffff968771988600 protocol=34525 location=0xffffffff94796c6a
ffffffff9468d50b kfree_skb+0x7b ([kernel.kallsyms])
ffffffff94796c6a ndisc_send_skb+0x2fa ([kernel.kallsyms])
ffffffff947975b4 ndisc_send_na+0x184 ([kernel.kallsyms])
ffffffff94798143 ndisc_recv_ns+0x2f3 ([kernel.kallsyms])
ffffffff94799b46 ndisc_rcv+0xe6 ([kernel.kallsyms])
ffffffff947a1fa8 icmpv6_rcv+0x428 ([kernel.kallsyms])
ffffffff9477bcd3 ip6_input_finish+0xf3 ([kernel.kallsyms])
ffffffff9477c11f ip6_input+0x3f ([kernel.kallsyms])
ffffffff9477c787 ip6_mc_input+0x97 ([kernel.kallsyms])
ffffffff9477c0cc ip6_rcv_finish+0x7c ([kernel.kallsyms])
ffffffff947d9fd2 ip_sabotage_in+0x42 ([kernel.kallsyms])
ffffffff946f3822 nf_hook_slow+0x42 ([kernel.kallsyms])
ffffffff9477c569 ipv6_rcv+0xc9 ([kernel.kallsyms])
ffffffff946a5de7 __netif_receive_skb_one_core+0x57
([kernel.kallsyms])
ffffffff946a5e48 __netif_receive_skb+0x18 ([kernel.kallsyms])
ffffffff946a5145 netif_receive_skb_internal+0x45
([kernel.kallsyms])
ffffffff946a520c netif_receive_skb+0x1c ([kernel.kallsyms])
ffffffff947c7d03 br_netif_receive_skb+0x43 ([kernel.kallsyms])
ffffffff947c7ded br_pass_frame_up+0xcd ([kernel.kallsyms])
ffffffff947c80ca br_handle_frame_finish+0x24a ([kernel.kallsyms])
ffffffff947dae0f br_nf_hook_thresh+0xdf ([kernel.kallsyms])
ffffffff947dbf19 br_nf_pre_routing_finish_ipv6+0x109
([kernel.kallsyms])
ffffffff947dc39a br_nf_pre_routing_ipv6+0xfa ([kernel.kallsyms])
ffffffff947dbbe9 br_nf_pre_routing+0x1c9 ([kernel.kallsyms])
ffffffff946f3822 nf_hook_slow+0x42 ([kernel.kallsyms])
ffffffff947c850f br_handle_frame+0x1ef ([kernel.kallsyms])
ffffffff946a5471 __netif_receive_skb_core+0x211 ([kernel.kallsyms])
ffffffff946a5dcb __netif_receive_skb_one_core+0x3b
([kernel.kallsyms])
ffffffff946a5e48 __netif_receive_skb+0x18 ([kernel.kallsyms])
ffffffff946a5145 netif_receive_skb_internal+0x45
([kernel.kallsyms])
ffffffff946a6fb0 napi_gro_receive+0xd0 ([kernel.kallsyms])
ffffffffc05c319f ixgbe_clean_rx_irq+0x46f ([kernel.kallsyms])
ffffffffc05c4610 ixgbe_poll+0x280 ([kernel.kallsyms])
ffffffff946a6729 net_rx_action+0x289 ([kernel.kallsyms])
ffffffff94c000d1 __softirqentry_text_start+0xd1 ([kernel.kallsyms])
ffffffff94075108 irq_exit+0xe8 ([kernel.kallsyms])
ffffffff94a01a69 do_IRQ+0x59 ([kernel.kallsyms])
ffffffff94a0098f ret_from_intr+0x0 ([kernel.kallsyms])
ffffffff9464e01d cpuidle_enter_state+0xbd ([kernel.kallsyms])
ffffffff9464e287 cpuidle_enter+0x17 ([kernel.kallsyms])
ffffffff940a3cd3 call_cpuidle+0x23 ([kernel.kallsyms])
ffffffff940a3f78 do_idle+0x1c8 ([kernel.kallsyms])
ffffffff940a4203 cpu_startup_entry+0x73 ([kernel.kallsyms])
ffffffff9403fade start_secondary+0x1ae ([kernel.kallsyms])
ffffffff940000d4 secondary_startup_64+0xa4 ([kernel.kallsyms])
However, I can't seem to determine why this is failing. It seems like
the only way to hit kfree_skb within ndisc_send_skb would be if
icmp6_dst_alloc fails?