I'm currently working on the routing table interface to make is safe to use by multiple CPUs at the same time. The diff below is a big step in this direction and I'd really appreciate if people could test it with their usual network setup and report back.
The goal of this diff is to "cache" the route corresponding to "your" next hop as early as possible. Let's assume you're using a common dhcp-based network: mpi@goiaba $ netstat -rnf inet|egrep "(default|Dest)" Destination Gateway Flags Refs Use Mtu Prio Iface default 192.168.0.1 UGS 5 508 - 8 em0 Here my default route points to a gateway (G) whose address is 192.168.0.1. In such setup your computer generally sends most of the packets to the internet through this gateway. But to do that it needs more informations: mpi@goiaba $ netstat -rnf inet|egrep "(192.168.0.1.*L|Dest)" Destination Gateway Flags Refs Use Mtu Prio Iface 192.168.0.1 bc:05:43:bd:3e:29 UHLc 1 149 - 8 em0 Yes this is another route. This one contains link-layer informations (L) and has been cloned (c). This route is what I described before as "your" next hop. In this case, "your" is a shortcut for "the next hop of your default route" but all of this is valid for any route pointing to a gateway (G). In order to send packets via my default route, the kernel needs to know the link-layer address corresponding to the IP address of the gateway. This is called "Address Resolution" in network jargon. In OpenBSD, resolved addresses appear in the routing table with a link-layer address in the "Gateway" field, as shown previously. This resolution is done in the kernel by calling rtalloc(9) with the RT_RESOLVE flag for the wanted destination, in my case 192.168.0.1. Once the resolution is complete, a corresponding entry appears in the routing table and there's no need to redo it for a certain period of time. That is what I meant with "cache". Currently this resolution is done "late" in the journey of a packet and that's fine since it is not done often. Late means that it is done when the packet reaches a L2 output function, nd6_output() or arpresolve(). The problem is that having a proper reference count on route entries in these functions is really complicated because you can end up using 3 different routes. So this diff starts the resolution early: as soon as a gateway route is returned by rtalloc(9). It also makes sense to do the resolution as soon as possible since we need the link-layer address to send the packet. One important point: gateway routes (rt_gwroute) are only returned to the stack in L2 functions and when that happens, their reference counter is not incremented. That's why the reference count for such routes is almost always 1. They are the simplest example of working route caching in our kernel*. That means that when you purge your cloned route, rt_gwroute will still be valid but marked as RTP_DOWN until a new resolution is started. This diff changes rt_checkgate() to only do sanity checks (finally!). Do not hesitate to ask questions if something is not clear, I believe it's important that more people understand this. Note that this diff includes other bits to be committed separately: - Deprecate the use of RTF_XRESOLVE in rtalloc(9) - Remove PF_KEY-specific code & comments now that SPD lookups no longer use rtalloc(9). - Make rtfree(9) accept NULL * That's why I'm slowly killing "struct route" & friends to use the simplest route caching mechanism everywhere. Index: net/route.c =================================================================== RCS file: /cvs/src/sys/net/route.c,v retrieving revision 1.217 diff -u -p -r1.217 route.c --- net/route.c 18 Jul 2015 15:51:16 -0000 1.217 +++ net/route.c 12 Aug 2015 13:54:56 -0000 @@ -153,6 +153,7 @@ int rtable_alloc(void ***, u_int); int rtflushclone1(struct rtentry *, void *, u_int); void rtflushclone(unsigned int, struct rtentry *); int rt_if_remove_rtdelete(struct rtentry *, void *, u_int); +struct rtentry *rt_match(struct sockaddr *, int, unsigned int); struct ifaddr *ifa_ifwithroute(int, struct sockaddr *, struct sockaddr *, u_int); @@ -297,17 +298,31 @@ rtable_exists(u_int id) /* verify table return (1); } +/* + * Do the actual lookup for rtalloc(9), do not use directly! + * + * Return the best matching entry for the destination ``dst''. + * + * "RT_RESOLVE" means that a corresponding L2 entry should + * be added to the routing table and resolved (via ARP or + * NDP), if it does not exist. + * + * "RT_REPORT" indicates that a message should be sent to + * userland if no matching route has been found or if an + * error occured while adding a L2 entry. + */ struct rtentry * -rtalloc(struct sockaddr *dst, int flags, unsigned int tableid) +rt_match(struct sockaddr *dst, int flags, unsigned int tableid) { struct rtentry *rt; struct rtentry *newrt = 0; struct rt_addrinfo info; - int s = splsoftnet(), err = 0, msgtype = RTM_MISS; + int s, err = 0; bzero(&info, sizeof(info)); info.rti_info[RTAX_DST] = dst; + s = splsoftnet(); rt = rtable_match(tableid, dst); if (rt != NULL) { newrt = rt; @@ -319,28 +334,15 @@ rtalloc(struct sockaddr *dst, int flags, rt->rt_refcnt++; goto miss; } - if ((rt = newrt) && (rt->rt_flags & RTF_XRESOLVE)) { - msgtype = RTM_RESOLVE; - goto miss; - } /* Inform listeners of the new route */ rt_sendmsg(rt, RTM_ADD, tableid); } else rt->rt_refcnt++; } else { - if (dst->sa_family != PF_KEY) - rtstat.rts_unreach++; - /* - * IP encapsulation does lots of lookups where we don't need nor want - * the RTM_MISSes that would be generated. It causes RTM_MISS storms - * sent upward breaking user-level routing queries. - */ + rtstat.rts_unreach++; miss: - if (ISSET(flags, RT_REPORT) && dst->sa_family != PF_KEY) { - bzero((caddr_t)&info, sizeof(info)); - info.rti_info[RTAX_DST] = dst; - rt_missmsg(msgtype, &info, 0, NULL, err, tableid); - } + if (ISSET(flags, RT_REPORT)) + rt_missmsg(RTM_MISS, &info, 0, NULL, err, tableid); } splx(s); return (newrt); @@ -374,12 +376,81 @@ rtalloc_mpath(struct sockaddr *dst, uint } #endif /* SMALL_KERNEL */ +/* + * Look in the routing table for the best matching entry for + * ``dst''. + * + * If a route with a gateway is found and its next hop is no + * longer valid, try to cache it. + */ +struct rtentry * +rtalloc(struct sockaddr *dst, int flags, unsigned int rtableid) +{ + struct rtentry *rt, *nhrt; + + rt = rt_match(dst, flags, rtableid); + + /* No match or route to host? We're done. */ + if (rt == NULL || (rt->rt_flags & RTF_GATEWAY) == 0) + return (rt); + + nhrt = rt->rt_gwroute; + + /* Nothing to do if the next hop is valid. */ + if (nhrt != NULL && (nhrt->rt_flags & RTF_UP)) + return (rt); + + rtfree(rt->rt_gwroute); + rt->rt_gwroute = NULL; + + /* + * If we cannot find a valid next hop, return the route + * with a gateway. + * Some dragons hiding in the tree certainly depends on + * this behavior. + */ + nhrt = rt_match(rt->rt_gateway, flags, rtableid); + if (nhrt == NULL) + return (rt); + + /* + * Next hop must be reachable, this also prevents rtentry + * loops for example when rt->rt_gwroute points to rt. + */ + if ((nhrt->rt_flags & (RTF_UP|RTF_GATEWAY)) != RTF_UP) { + rtfree(nhrt); + return (rt); + } + + /* + * Next hop entry MUST be on the same interface. + * + * XXX We could use a KASSERT() here if routes with dangling + * ``ifa'' pointers were dropped. + */ + if (nhrt->rt_ifp != rt->rt_ifp) { + rtfree(nhrt); + return (rt); + } + + /* + * If the MTU of next hop is 0, this will reset the MTU of the + * route to run PMTUD again from scratch. + */ + if (!ISSET(rt->rt_locks, RTV_MTU) && (rt->rt_mtu > nhrt->rt_mtu)) + rt->rt_mtu = nhrt->rt_mtu; + + rt->rt_gwroute = nhrt; + return (rt); +} + void rtfree(struct rtentry *rt) { struct ifaddr *ifa; - KASSERT(rt != NULL); + if (rt == NULL) + return; rt->rt_refcnt--; @@ -526,7 +597,7 @@ create: rt->rt_flags |= RTF_MODIFIED; flags |= RTF_MODIFIED; stat = &rtstat.rts_newgateway; - rt_setgate(rt, gateway, rdomain); + rt_setgate(rt, gateway); } } else error = EHOSTUNREACH; @@ -983,8 +1054,7 @@ rtrequest1(int req, struct rt_addrinfo * * the routing table because the radix MPATH code use * it to (re)order routes. */ - if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY], - tableid))) { + if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY]))) { free(ndst, M_RTABLE, dlen); pool_put(&rtentry_pool, rt); return (error); @@ -1035,7 +1105,7 @@ rtrequest1(int req, struct rt_addrinfo * } int -rt_setgate(struct rtentry *rt, struct sockaddr *gate, unsigned int tableid) +rt_setgate(struct rtentry *rt, struct sockaddr *gate) { int glen = ROUNDUP(gate->sa_len); struct sockaddr *sa; @@ -1053,22 +1123,7 @@ rt_setgate(struct rtentry *rt, struct so rtfree(rt->rt_gwroute); rt->rt_gwroute = NULL; } - if (rt->rt_flags & RTF_GATEWAY) { - /* XXX is this actually valid to cross tables here? */ - rt->rt_gwroute = rtalloc(gate, RT_REPORT|RT_RESOLVE, tableid); - /* - * If we switched gateways, grab the MTU from the new - * gateway route if the current MTU is 0 or greater - * than the MTU of gateway. - * Note that, if the MTU of gateway is 0, we will reset the - * MTU of the route to run PMTUD again from scratch. XXX - */ - if (rt->rt_gwroute && !(rt->rt_rmx.rmx_locks & RTV_MTU) && - rt->rt_rmx.rmx_mtu && - rt->rt_rmx.rmx_mtu > rt->rt_gwroute->rt_rmx.rmx_mtu) { - rt->rt_rmx.rmx_mtu = rt->rt_gwroute->rt_rmx.rmx_mtu; - } - } + return (0); } @@ -1080,28 +1135,21 @@ rt_checkgate(struct ifnet *ifp, struct r KASSERT(rt != NULL); - if ((rt->rt_flags & RTF_UP) == 0) { - rt = rtalloc(dst, RT_REPORT|RT_RESOLVE, rtableid); - if (rt == NULL) - return (EHOSTUNREACH); - rt->rt_refcnt--; - if (rt->rt_ifp != ifp) - return (EHOSTUNREACH); - } + if ((rt->rt_flags & RTF_UP) == 0) + return (EHOSTUNREACH); rt0 = rt; if (rt->rt_flags & RTF_GATEWAY) { - if (rt->rt_gwroute && !(rt->rt_gwroute->rt_flags & RTF_UP)) { + if (rt->rt_gwroute == NULL) + return (EHOSTUNREACH); + + if ((rt->rt_gwroute->rt_flags & RTF_UP) == 0) { rtfree(rt->rt_gwroute); rt->rt_gwroute = NULL; + return (EHOSTUNREACH); } - if (rt->rt_gwroute == NULL) { - rt->rt_gwroute = rtalloc(rt->rt_gateway, - RT_REPORT|RT_RESOLVE, rtableid); - if (rt->rt_gwroute == NULL) - return (EHOSTUNREACH); - } + /* * Next hop must be reachable, this also prevents rtentry * loops, for example when rt->rt_gwroute points to rt. Index: net/route.h =================================================================== RCS file: /cvs/src/sys/net/route.h,v retrieving revision 1.109 diff -u -p -r1.109 route.h --- net/route.h 18 Jul 2015 15:51:16 -0000 1.109 +++ net/route.h 12 Aug 2015 12:20:45 -0000 @@ -111,6 +111,8 @@ struct rtentry { }; #define rt_use rt_rmx.rmx_pksent #define rt_expire rt_rmx.rmx_expire +#define rt_locks rt_rmx.rmx_locks +#define rt_mtu rt_rmx.rmx_mtu #define RTF_UP 0x1 /* route usable */ #define RTF_GATEWAY 0x2 /* destination is a gateway */ @@ -354,7 +356,7 @@ void rt_sendmsg(struct rtentry *, int, void rt_sendaddrmsg(struct rtentry *, int); void rt_missmsg(int, struct rt_addrinfo *, int, struct ifnet *, int, u_int); -int rt_setgate(struct rtentry *, struct sockaddr *, unsigned int); +int rt_setgate(struct rtentry *, struct sockaddr *); int rt_checkgate(struct ifnet *, struct rtentry *, struct sockaddr *, unsigned int, struct rtentry **); void rt_setmetrics(u_long, struct rt_metrics *, struct rt_kmetrics *); Index: net/rtsock.c =================================================================== RCS file: /cvs/src/sys/net/rtsock.c,v retrieving revision 1.166 diff -u -p -r1.166 rtsock.c --- net/rtsock.c 18 Jul 2015 21:58:06 -0000 1.166 +++ net/rtsock.c 12 Aug 2015 12:20:45 -0000 @@ -748,9 +748,8 @@ report: info.rti_info[RTAX_GATEWAY]->sa_len)) { newgate = 1; } - if (info.rti_info[RTAX_GATEWAY] != NULL && - (error = rt_setgate(rt, info.rti_info[RTAX_GATEWAY], - tableid))) + if (info.rti_info[RTAX_GATEWAY] != NULL && (error = + rt_setgate(rt, info.rti_info[RTAX_GATEWAY]))) goto flush; /* * new gateway could require new ifaddr, ifp;