On 12/08/15(Wed) 17:03, Martin Pieuchot wrote: > I'm currently working on the routing table interface to make is safe > to use by multiple CPUs at the same time. The diff below is a big > step in this direction and I'd really appreciate if people could test > it with their usual network setup and report back.
Updated version to match recent changes. I'm still looking for test reports and reviews. > The goal of this diff is to "cache" the route corresponding to "your" > next hop as early as possible. Let's assume you're using a common > dhcp-based network: > > mpi@goiaba $ netstat -rnf inet|egrep "(default|Dest)" > Destination Gateway Flags Refs Use Mtu Prio Iface > default 192.168.0.1 UGS 5 508 - 8 em0 > > Here my default route points to a gateway (G) whose address is > 192.168.0.1. In such setup your computer generally sends most of the > packets to the internet through this gateway. But to do that it needs > more informations: > > mpi@goiaba $ netstat -rnf inet|egrep "(192.168.0.1.*L|Dest)" > Destination Gateway Flags Refs Use Mtu Prio Iface > 192.168.0.1 bc:05:43:bd:3e:29 UHLc 1 149 - 8 em0 > > Yes this is another route. This one contains link-layer informations > (L) and has been cloned (c). This route is what I described before as > "your" next hop. In this case, "your" is a shortcut for "the next hop > of your default route" but all of this is valid for any route pointing > to a gateway (G). > > In order to send packets via my default route, the kernel needs to know > the link-layer address corresponding to the IP address of the gateway. > This is called "Address Resolution" in network jargon. In OpenBSD, > resolved addresses appear in the routing table with a link-layer address > in the "Gateway" field, as shown previously. > > This resolution is done in the kernel by calling rtalloc(9) with the > RT_RESOLVE flag for the wanted destination, in my case 192.168.0.1. > Once the resolution is complete, a corresponding entry appears in the > routing table and there's no need to redo it for a certain period of > time. That is what I meant with "cache". > > Currently this resolution is done "late" in the journey of a packet and > that's fine since it is not done often. Late means that it is done when > the packet reaches a L2 output function, nd6_output() or arpresolve(). > > The problem is that having a proper reference count on route entries in > these functions is really complicated because you can end up using 3 > different routes. So this diff starts the resolution early: as soon as > a gateway route is returned by rtalloc(9). > > It also makes sense to do the resolution as soon as possible since we > need the link-layer address to send the packet. > > One important point: gateway routes (rt_gwroute) are only returned to > the stack in L2 functions and when that happens, their reference > counter is not incremented. That's why the reference count for such > routes is almost always 1. They are the simplest example of working > route caching in our kernel*. That means that when you purge your > cloned route, rt_gwroute will still be valid but marked as RTP_DOWN > until a new resolution is started. > > This diff changes rt_checkgate() to only do sanity checks (finally!). > > Do not hesitate to ask questions if something is not clear, I believe > it's important that more people understand this. > > Note that this diff includes other bits to be committed separately: > > - Deprecate the use of RTF_XRESOLVE in rtalloc(9) > - Remove PF_KEY-specific code & comments now that SPD lookups no > longer use rtalloc(9). > - Make rtfree(9) accept NULL > > > * That's why I'm slowly killing "struct route" & friends to use the > simplest route caching mechanism everywhere. Index: net/route.c =================================================================== RCS file: /cvs/src/sys/net/route.c,v retrieving revision 1.225 diff -u -p -r1.225 route.c --- net/route.c 24 Aug 2015 22:11:33 -0000 1.225 +++ net/route.c 25 Aug 2015 10:23:02 -0000 @@ -153,6 +153,7 @@ int rtable_alloc(void ***, u_int); int rtflushclone1(struct rtentry *, void *, u_int); void rtflushclone(unsigned int, struct rtentry *); int rt_if_remove_rtdelete(struct rtentry *, void *, u_int); +struct rtentry *rt_match(struct sockaddr *, int, unsigned int); struct ifaddr *ifa_ifwithroute(int, struct sockaddr *, struct sockaddr *, u_int); @@ -297,19 +298,32 @@ rtable_exists(u_int id) /* verify table return (1); } +/* + * Do the actual lookup for rtalloc(9), do not use directly! + * + * Return the best matching entry for the destination ``dst''. + * + * "RT_RESOLVE" means that a corresponding L2 entry should + * be added to the routing table and resolved (via ARP or + * NDP), if it does not exist. + * + * "RT_REPORT" indicates that a message should be sent to + * userland if no matching route has been found or if an + * error occured while adding a L2 entry. + */ struct rtentry * -rtalloc(struct sockaddr *dst, int flags, unsigned int tableid) +rt_match(struct sockaddr *dst, int flags, unsigned int tableid) { struct rtentry *rt; struct rtentry *newrt = NULL; struct rt_addrinfo info; - int s, error = 0, msgtype = RTM_MISS; + int s, error = 0; - s = splsoftnet(); bzero(&info, sizeof(info)); info.rti_info[RTAX_DST] = dst; + s = splsoftnet(); rt = rtable_match(tableid, dst); if (rt != NULL) { newrt = rt; @@ -322,10 +336,6 @@ rtalloc(struct sockaddr *dst, int flags, goto miss; } rt = newrt; - if (rt->rt_flags & RTF_XRESOLVE) { - msgtype = RTM_RESOLVE; - goto miss; - } /* Inform listeners of the new route */ rt_sendmsg(rt, RTM_ADD, tableid); } else @@ -333,11 +343,8 @@ rtalloc(struct sockaddr *dst, int flags, } else { rtstat.rts_unreach++; miss: - if (ISSET(flags, RT_REPORT)) { - bzero((caddr_t)&info, sizeof(info)); - info.rti_info[RTAX_DST] = dst; - rt_missmsg(msgtype, &info, 0, NULL, error, tableid); - } + if (ISSET(flags, RT_REPORT)) + rt_missmsg(RTM_MISS, &info, 0, NULL, error, tableid); } splx(s); return (newrt); @@ -371,6 +378,75 @@ rtalloc_mpath(struct sockaddr *dst, uint } #endif /* SMALL_KERNEL */ +/* + * Look in the routing table for the best matching entry for + * ``dst''. + * + * If a route with a gateway is found and its next hop is no + * longer valid, try to cache it. + */ +struct rtentry * +rtalloc(struct sockaddr *dst, int flags, unsigned int rtableid) +{ + struct rtentry *rt, *nhrt; + + rt = rt_match(dst, flags, rtableid); + + /* No match or route to host? We're done. */ + if (rt == NULL || (rt->rt_flags & RTF_GATEWAY) == 0) + return (rt); + + nhrt = rt->rt_gwroute; + + /* Nothing to do if the next hop is valid. */ + if (nhrt != NULL && (nhrt->rt_flags & RTF_UP)) + return (rt); + + rtfree(rt->rt_gwroute); + rt->rt_gwroute = NULL; + + /* + * If we cannot find a valid next hop, return the route + * with a gateway. + * + * Some dragons hiding in the tree certainly depends on + * this behavior. + */ + nhrt = rt_match(rt->rt_gateway, flags | RT_RESOLVE, rtableid); + if (nhrt == NULL) + return (rt); + + /* + * Next hop must be reachable, this also prevents rtentry + * loops for example when rt->rt_gwroute points to rt. + */ + if ((nhrt->rt_flags & (RTF_UP|RTF_CLONING|RTF_GATEWAY)) != RTF_UP) { + rtfree(nhrt); + return (rt); + } + + /* + * Next hop entry MUST be on the same interface. + * + * XXX We could use a KASSERT() here if routes with dangling + * ``ifa'' pointers were dropped. + */ + if (nhrt->rt_ifp != rt->rt_ifp) { + rtfree(nhrt); + return (rt); + } + + /* + * If the MTU of next hop is 0, this will reset the MTU of the + * route to run PMTUD again from scratch. + */ + if (!ISSET(rt->rt_locks, RTV_MTU) && (rt->rt_mtu > nhrt->rt_mtu)) + rt->rt_mtu = nhrt->rt_mtu; + + rt->rt_gwroute = nhrt; + return (rt); +} + void rtfree(struct rtentry *rt) { @@ -524,7 +600,7 @@ create: rt->rt_flags |= RTF_MODIFIED; flags |= RTF_MODIFIED; stat = &rtstat.rts_newgateway; - rt_setgate(rt, gateway, rdomain); + rt_setgate(rt, gateway); } } else error = EHOSTUNREACH; @@ -979,8 +1055,7 @@ rtrequest1(int req, struct rt_addrinfo * * the routing table because the radix MPATH code use * it to (re)order routes. */ - if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY], - tableid))) { + if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY]))) { free(ndst, M_RTABLE, dlen); pool_put(&rtentry_pool, rt); return (error); @@ -1031,7 +1106,7 @@ rtrequest1(int req, struct rt_addrinfo * } int -rt_setgate(struct rtentry *rt, struct sockaddr *gate, unsigned int tableid) +rt_setgate(struct rtentry *rt, struct sockaddr *gate) { int glen = ROUNDUP(gate->sa_len); struct sockaddr *sa; @@ -1049,22 +1124,7 @@ rt_setgate(struct rtentry *rt, struct so rtfree(rt->rt_gwroute); rt->rt_gwroute = NULL; } - if (rt->rt_flags & RTF_GATEWAY) { - /* XXX is this actually valid to cross tables here? */ - rt->rt_gwroute = rtalloc(gate, RT_REPORT|RT_RESOLVE, tableid); - /* - * If we switched gateways, grab the MTU from the new - * gateway route if the current MTU is 0 or greater - * than the MTU of gateway. - * Note that, if the MTU of gateway is 0, we will reset the - * MTU of the route to run PMTUD again from scratch. XXX - */ - if (rt->rt_gwroute && !(rt->rt_rmx.rmx_locks & RTV_MTU) && - rt->rt_rmx.rmx_mtu && - rt->rt_rmx.rmx_mtu > rt->rt_gwroute->rt_rmx.rmx_mtu) { - rt->rt_rmx.rmx_mtu = rt->rt_gwroute->rt_rmx.rmx_mtu; - } - } + return (0); } @@ -1076,28 +1136,21 @@ rt_checkgate(struct ifnet *ifp, struct r KASSERT(rt != NULL); - if ((rt->rt_flags & RTF_UP) == 0) { - rt = rtalloc(dst, RT_REPORT|RT_RESOLVE, rtableid); - if (rt == NULL) - return (EHOSTUNREACH); - rt->rt_refcnt--; - if (rt->rt_ifp != ifp) - return (EHOSTUNREACH); - } + if ((rt->rt_flags & RTF_UP) == 0) + return (EHOSTUNREACH); rt0 = rt; if (rt->rt_flags & RTF_GATEWAY) { - if (rt->rt_gwroute && !(rt->rt_gwroute->rt_flags & RTF_UP)) { + if (rt->rt_gwroute == NULL) + return (EHOSTUNREACH); + + if ((rt->rt_gwroute->rt_flags & RTF_UP) == 0) { rtfree(rt->rt_gwroute); rt->rt_gwroute = NULL; + return (EHOSTUNREACH); } - if (rt->rt_gwroute == NULL) { - rt->rt_gwroute = rtalloc(rt->rt_gateway, - RT_REPORT|RT_RESOLVE, rtableid); - if (rt->rt_gwroute == NULL) - return (EHOSTUNREACH); - } + /* * Next hop must be reachable, this also prevents rtentry * loops, for example when rt->rt_gwroute points to rt. Index: net/route.h =================================================================== RCS file: /cvs/src/sys/net/route.h,v retrieving revision 1.110 diff -u -p -r1.110 route.h --- net/route.h 20 Aug 2015 12:39:43 -0000 1.110 +++ net/route.h 25 Aug 2015 10:23:02 -0000 @@ -118,6 +118,8 @@ struct rtentry { }; #define rt_use rt_rmx.rmx_pksent #define rt_expire rt_rmx.rmx_expire +#define rt_locks rt_rmx.rmx_locks +#define rt_mtu rt_rmx.rmx_mtu #define RTF_UP 0x1 /* route usable */ #define RTF_GATEWAY 0x2 /* destination is a gateway */ @@ -361,7 +363,7 @@ void rt_sendmsg(struct rtentry *, int, void rt_sendaddrmsg(struct rtentry *, int); void rt_missmsg(int, struct rt_addrinfo *, int, struct ifnet *, int, u_int); -int rt_setgate(struct rtentry *, struct sockaddr *, unsigned int); +int rt_setgate(struct rtentry *, struct sockaddr *); int rt_checkgate(struct ifnet *, struct rtentry *, struct sockaddr *, unsigned int, struct rtentry **); void rt_setmetrics(u_long, struct rt_metrics *, struct rt_kmetrics *); Index: net/rtsock.c =================================================================== RCS file: /cvs/src/sys/net/rtsock.c,v retrieving revision 1.169 diff -u -p -r1.169 rtsock.c --- net/rtsock.c 24 Aug 2015 22:11:33 -0000 1.169 +++ net/rtsock.c 25 Aug 2015 10:23:02 -0000 @@ -744,9 +744,8 @@ report: info.rti_info[RTAX_GATEWAY]->sa_len)) { newgate = 1; } - if (info.rti_info[RTAX_GATEWAY] != NULL && - (error = rt_setgate(rt, info.rti_info[RTAX_GATEWAY], - tableid))) + if (info.rti_info[RTAX_GATEWAY] != NULL && (error = + rt_setgate(rt, info.rti_info[RTAX_GATEWAY]))) goto flush; /* * new gateway could require new ifaddr, ifp;