On Wed, Oct 09, 2019 at 11:00:07AM -0500, Jesse Hathaway wrote: > We have been experiencing a route lookup race condition on our internet facing > Linux routers. I have been able to reproduce the issue, but would love more > help in isolating the cause. > > Looking up a route found in the main table returns `*` rather than the > directly > connected interface about once for every 10-20 million requests. From my > reading of the iproute2 source code an asterisk is indicative of the kernel > returning and interface index of 0 rather than the correct directly connected > interface. > > This is reproducible with the following bash snippet on 5.4-rc2: > > $ cat route-race > #!/bin/bash > > # Generate 50 million individual route gets to feed as batch input to `ip` > function ip-cmds() { > route_get='route get 192.168.11.142 from 192.168.180.10 iif vlan180' > for ((i = 0; i < 50000000; i++)); do > printf '%s\n' "${route_get}" > done > > } > > ip-cmds | ip -d -o -batch - | grep -E 'dev \*' | uniq -c > > Example output: > > $ ./route-race > 6 unicast 192.168.11.142 from 192.168.180.10 dev * table main > \ cache iif vlan180 > > These routers have multiple routing tables and are ingesting full BGP routing > tables from multiple ISPs: > > $ ip route show table all | wc -l > 3105543 > > $ ip route show table main | wc -l > 54 > > Please let me know what other information I can provide, thanks in advance,
I think it's working as expected. Here is my theory: If CPU0 is executing both the route get request and forwarding packets through the directly connected interface, then the following can happen: <CPU0, t0> - In process context, per-CPU dst entry cached in the nexthop is found. Not yet dumped to user space <Any CPU, t1> - Routes are added / removed, therefore invalidating the cache by bumping 'net->ipv4.rt_genid' <CPU0, t2> - In softirq, packet is forwarded through the nexthop. The cached dst entry is found to be invalid. Therefore, it is replaced by a newer dst entry. dst_dev_put() is called on old entry which assigns the blackhole netdev to 'dst->dev'. This netdev has an ifindex of 0 because it is not registered. <CPU0, t3> - After softirq finished executing, your route get request from t0 is resumed and the old dst entry is dumped to user space with ifindex of 0. I tested this on my system using your script to generate the route get requests. I pinned it to the same CPU forwarding packets through the nexthop. To constantly invalidate the cache I created another script that simply adds and removes IP addresses from an interface. If I stop the packet forwarding or the script that invalidates the cache, then I don't see any '*' answers to my route get requests. BTW, the blackhole netdev was added in 5.3. I assume (didn't test) that with older kernel versions you'll see 'lo' instead of '*'.