On Fri, Oct 11, 2019 at 09:36:51AM -0500, Jesse Hathaway wrote: > On Thu, Oct 10, 2019 at 3:31 AM Ido Schimmel <ido...@idosch.org> wrote: > > I think it's working as expected. Here is my theory: > > > > If CPU0 is executing both the route get request and forwarding packets > > through the directly connected interface, then the following can happen: > > > > <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop > > is found. Not yet dumped to user space > > > > <Any CPU, t1> - Routes are added / removed, therefore invalidating the > > cache by bumping 'net->ipv4.rt_genid' > > > > <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The > > cached dst entry is found to be invalid. Therefore, it is replaced by a > > newer dst entry. dst_dev_put() is called on old entry which assigns the > > blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because > > it is not registered. > > > > <CPU0, t3> - After softirq finished executing, your route get request > > from t0 is resumed and the old dst entry is dumped to user space with > > ifindex of 0. > > > > I tested this on my system using your script to generate the route get > > requests. I pinned it to the same CPU forwarding packets through the > > nexthop. To constantly invalidate the cache I created another script > > that simply adds and removes IP addresses from an interface. > > > > If I stop the packet forwarding or the script that invalidates the > > cache, then I don't see any '*' answers to my route get requests. > > Thanks for the reply and analysis Ido, I tested with an additional script > which > adds and deletes a route in a loop, as you also saw this increased the > frequency of blackhole route replies from the first script. > > Questions: > > 1. We saw this behavior occurring with TCP connections traversing our routers, > though I was able to reproduce it with only local route requests on our > router. > Would you expect this same behavior for TCP traffic only in the kernel which > does not go to userspace?
Yes, the problem is in the input path where received packets need to be forwarded. > > 2. These blackhole routes occur even though our main routing table is not > changing, however a separate route table managed by bird on the Linux router > is > changing. Is this still expected behavior given that the ip-rules and main > route table used by these route requests are not changing? Yes, there is a per-netns counter that is incremented whenever cached dst entries need to be invalidated. Since it is per-netns it is incremented regardless of the routing table to which your insert the route. > > 3. We were previously rejecting these packets with an iptables rule which sent > an ICMP prohibited message to the sender, this caused TCP connections to break > with a EHOSTUNREACH, should we be silently dropping these packets instead? > > 4. If we should just be dropping these packets, why does the kernel not drop > them instead of letting them traverse the iptables rules? I actually believe the current behavior is a bug that needs to be fixed. See below. > > > BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that > > with older kernel versions you'll see 'lo' instead of '*'. > > Yes indeed! Thanks for solving that mystery as well, our routers are running > 5.1, but we upgraded to 5.4-rc2 to determine whether the issue was still > present in the latest kernel. Do you remember when you started seeing this behavior? I think it started in 4.13 with commit ffe95ecf3a2e ("Merge branch 'net-remove-dst-garbage-collector-logic'"). Let me add Wei to see if/how this can be fixed. Wei, in case you don't have the original mail with the description of the problem, it can be found here [1]. I believe that the issue Jesse is experiencing is the following: <CPU A, t0> - Received packet A is forwarded and cached dst entry is taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set() <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables from multiple ISPs"), route is added / deleted and rt_cache_flush() is called <CPU B, t2> - Received packet B tries to use the same cached dst entry from t0, but rt_cache_valid() is no longer true and it is replaced in rt_cache_route() by the newer one. This calls dst_dev_put() on the original dst entry which assigns the blackhole netdev to 'dst->dev' <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due to 'dst->dev' being the blackhole netdev The following patch "fixes" the problem for me: diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 42221a12bdda..1c67bdb80fd5 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1482,7 +1482,6 @@ static bool rt_cache_route(struct fib_nh_common *nhc, struct rtable *rt) prev = cmpxchg(p, orig, rt); if (prev == orig) { if (orig) { - dst_dev_put(&orig->dst); dst_release(&orig->dst); } } else { But if this dst entry is cached in some inactive socket and the netdev on which it took a reference needs to be unregistered, then we can potentially wait forever. No? I'm thinking that it can be fixed by making 'nhc_rth_input' per-CPU, in a similar fashion to what Eric did in commit d26b3a7c4b3b ("ipv4: percpu nh_rth_output cache"). Two questions: 1. Do you agree with the above analysis? 2. Do you have a simpler/better solution in mind? Thanks [1] https://lore.kernel.org/netdev/CANSNSoVM1Uo106xfJtGpTyXNed8kOL4JiXqf3A1eZHBa7z3=y...@mail.gmail.com/T/#medece9445d617372b4842d44525ef0d3ba1ea083