I'm currently working on the routing table interface to make is safe
to use by multiple CPUs at the same time.  The diff below is a big
step in this direction and I'd really appreciate if people could test
it with their usual network setup and report back.


The goal of this diff is to "cache" the route corresponding to "your"
next hop as early as possible.  Let's assume you're using a common
dhcp-based network:

mpi@goiaba $ netstat -rnf inet|egrep "(default|Dest)"
Destination        Gateway            Flags   Refs      Use   Mtu  Prio Iface
default            192.168.0.1        UGS        5      508     -     8 em0

Here my default route points to a gateway (G) whose address is
192.168.0.1.  In such setup your computer generally sends most of the
packets to the internet through this gateway.  But to do that it needs
more informations:

mpi@goiaba $ netstat -rnf inet|egrep "(192.168.0.1.*L|Dest)"
Destination        Gateway            Flags   Refs      Use   Mtu  Prio Iface
192.168.0.1        bc:05:43:bd:3e:29  UHLc       1      149     -     8 em0

Yes this is another route.  This one contains link-layer informations
(L) and has been cloned (c).  This route is what I described before as
"your" next hop.  In this case, "your" is a shortcut for "the next hop
of your default route" but all of this is valid for any route pointing
to a gateway (G).

In order to send packets via my default route, the kernel needs to know
the link-layer address corresponding to the IP address of the gateway.
This is called "Address Resolution" in network jargon.  In OpenBSD,
resolved addresses appear in the routing table with a link-layer address
in the "Gateway" field, as shown previously. 

This resolution is done in the kernel by calling rtalloc(9) with the
RT_RESOLVE flag for the wanted destination, in my case 192.168.0.1.
Once the resolution is complete, a corresponding entry appears in the
routing table and there's no need to redo it for a certain period of
time.  That is what I meant with "cache".

Currently this resolution is done "late" in the journey of a packet and
that's fine since it is not done often.  Late means that it is done when
the packet reaches a L2 output function, nd6_output() or arpresolve(). 

The problem is that having a proper reference count on route entries in
these functions is really complicated because you can end up using 3
different routes.  So this diff starts the resolution early: as soon as
a gateway route is returned by rtalloc(9).

It also makes sense to do the resolution as soon as possible since we
need the link-layer address to send the packet.

One important point: gateway routes (rt_gwroute) are only returned to
the stack in L2 functions and when that happens, their reference
counter is not incremented.  That's why the reference count for such
routes is almost always 1.  They are the simplest example of working
route caching in our kernel*.  That means that when you purge your
cloned route, rt_gwroute will still be valid but marked as RTP_DOWN
until a new resolution is started.

This diff changes rt_checkgate() to only do sanity checks (finally!).

Do not hesitate to ask questions if something is not clear, I believe
it's important that more people understand this.

Note that this diff includes other bits to be committed separately:

  - Deprecate the use of RTF_XRESOLVE in rtalloc(9)
  - Remove PF_KEY-specific code & comments now that SPD lookups no
    longer use rtalloc(9).
  - Make rtfree(9) accept NULL


* That's why I'm slowly killing "struct route" & friends to use the
  simplest route caching mechanism everywhere.

Index: net/route.c
===================================================================
RCS file: /cvs/src/sys/net/route.c,v
retrieving revision 1.217
diff -u -p -r1.217 route.c
--- net/route.c 18 Jul 2015 15:51:16 -0000      1.217
+++ net/route.c 12 Aug 2015 13:54:56 -0000
@@ -153,6 +153,7 @@ int rtable_alloc(void ***, u_int);
 int    rtflushclone1(struct rtentry *, void *, u_int);
 void   rtflushclone(unsigned int, struct rtentry *);
 int    rt_if_remove_rtdelete(struct rtentry *, void *, u_int);
+struct rtentry *rt_match(struct sockaddr *, int, unsigned int);
 
 struct ifaddr *ifa_ifwithroute(int, struct sockaddr *, struct sockaddr *,
                    u_int);
@@ -297,17 +298,31 @@ rtable_exists(u_int id)   /* verify table 
        return (1);
 }
 
+/*
+ * Do the actual lookup for rtalloc(9), do not use directly!
+ *
+ * Return the best matching entry for the destination ``dst''.
+ *
+ * "RT_RESOLVE" means that a corresponding L2 entry should
+ *   be added to the routing table and resolved (via ARP or
+ *   NDP), if it does not exist.
+ *
+ * "RT_REPORT" indicates that a message should be sent to
+ *   userland if no matching route has been found or if an
+ *   error occured while adding a L2 entry.
+ */
 struct rtentry *
-rtalloc(struct sockaddr *dst, int flags, unsigned int tableid)
+rt_match(struct sockaddr *dst, int flags, unsigned int tableid)
 {
        struct rtentry          *rt;
        struct rtentry          *newrt = 0;
        struct rt_addrinfo       info;
-       int                      s = splsoftnet(), err = 0, msgtype = RTM_MISS;
+       int                      s, err = 0;
 
        bzero(&info, sizeof(info));
        info.rti_info[RTAX_DST] = dst;
 
+       s = splsoftnet();
        rt = rtable_match(tableid, dst);
        if (rt != NULL) {
                newrt = rt;
@@ -319,28 +334,15 @@ rtalloc(struct sockaddr *dst, int flags,
                                rt->rt_refcnt++;
                                goto miss;
                        }
-                       if ((rt = newrt) && (rt->rt_flags & RTF_XRESOLVE)) {
-                               msgtype = RTM_RESOLVE;
-                               goto miss;
-                       }
                        /* Inform listeners of the new route */
                        rt_sendmsg(rt, RTM_ADD, tableid);
                } else
                        rt->rt_refcnt++;
        } else {
-               if (dst->sa_family != PF_KEY)
-                       rtstat.rts_unreach++;
-       /*
-        * IP encapsulation does lots of lookups where we don't need nor want
-        * the RTM_MISSes that would be generated.  It causes RTM_MISS storms
-        * sent upward breaking user-level routing queries.
-        */
+               rtstat.rts_unreach++;
 miss:
-               if (ISSET(flags, RT_REPORT) && dst->sa_family != PF_KEY) {
-                       bzero((caddr_t)&info, sizeof(info));
-                       info.rti_info[RTAX_DST] = dst;
-                       rt_missmsg(msgtype, &info, 0, NULL, err, tableid);
-               }
+               if (ISSET(flags, RT_REPORT))
+                       rt_missmsg(RTM_MISS, &info, 0, NULL, err, tableid);
        }
        splx(s);
        return (newrt);
@@ -374,12 +376,81 @@ rtalloc_mpath(struct sockaddr *dst, uint
 }
 #endif /* SMALL_KERNEL */
 
+/*
+ * Look in the routing table for the best matching entry for
+ * ``dst''.
+ *
+ * If a route with a gateway is found and its next hop is no
+ * longer valid, try to cache it.
+ */
+struct rtentry *
+rtalloc(struct sockaddr *dst, int flags, unsigned int rtableid)
+{
+       struct rtentry *rt, *nhrt;
+
+       rt = rt_match(dst, flags, rtableid);
+
+       /* No match or route to host?  We're done. */
+       if (rt == NULL || (rt->rt_flags & RTF_GATEWAY) == 0)
+               return (rt);
+
+       nhrt = rt->rt_gwroute;
+
+       /*  Nothing to do if the next hop is valid. */
+       if (nhrt != NULL && (nhrt->rt_flags & RTF_UP))
+               return (rt);
+
+       rtfree(rt->rt_gwroute);
+       rt->rt_gwroute = NULL;
+
+       /*
+        * If we cannot find a valid next hop, return the route
+        * with a gateway.
+        * Some dragons hiding in the tree certainly depends on
+        * this behavior.
+        */
+       nhrt = rt_match(rt->rt_gateway, flags, rtableid);
+       if (nhrt == NULL)
+               return (rt);
+
+       /*
+        * Next hop must be reachable, this also prevents rtentry
+        * loops for example when rt->rt_gwroute points to rt.
+        */
+       if ((nhrt->rt_flags & (RTF_UP|RTF_GATEWAY)) != RTF_UP) {
+               rtfree(nhrt);
+               return (rt);
+       }
+
+       /*
+        * Next hop entry MUST be on the same interface.
+        *
+        * XXX We could use a KASSERT() here if routes with dangling
+        * ``ifa'' pointers were dropped.
+        */
+       if (nhrt->rt_ifp != rt->rt_ifp) {
+               rtfree(nhrt);
+               return (rt);
+       }
+
+       /*
+        * If the MTU of next hop is 0, this will reset the MTU of the
+        * route to run PMTUD again from scratch.
+        */
+       if (!ISSET(rt->rt_locks, RTV_MTU) && (rt->rt_mtu > nhrt->rt_mtu))
+               rt->rt_mtu = nhrt->rt_mtu;
+
+       rt->rt_gwroute = nhrt;
+       return (rt);
+}
+
 void
 rtfree(struct rtentry *rt)
 {
        struct ifaddr   *ifa;
 
-       KASSERT(rt != NULL);
+       if (rt == NULL)
+               return;
 
        rt->rt_refcnt--;
 
@@ -526,7 +597,7 @@ create:
                        rt->rt_flags |= RTF_MODIFIED;
                        flags |= RTF_MODIFIED;
                        stat = &rtstat.rts_newgateway;
-                       rt_setgate(rt, gateway, rdomain);
+                       rt_setgate(rt, gateway);
                }
        } else
                error = EHOSTUNREACH;
@@ -983,8 +1054,7 @@ rtrequest1(int req, struct rt_addrinfo *
                 * the routing table because the radix MPATH code use
                 * it to (re)order routes.
                 */
-               if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY],
-                   tableid))) {
+               if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY]))) {
                        free(ndst, M_RTABLE, dlen);
                        pool_put(&rtentry_pool, rt);
                        return (error);
@@ -1035,7 +1105,7 @@ rtrequest1(int req, struct rt_addrinfo *
 }
 
 int
-rt_setgate(struct rtentry *rt, struct sockaddr *gate, unsigned int tableid)
+rt_setgate(struct rtentry *rt, struct sockaddr *gate)
 {
        int glen = ROUNDUP(gate->sa_len);
        struct sockaddr *sa;
@@ -1053,22 +1123,7 @@ rt_setgate(struct rtentry *rt, struct so
                rtfree(rt->rt_gwroute);
                rt->rt_gwroute = NULL;
        }
-       if (rt->rt_flags & RTF_GATEWAY) {
-               /* XXX is this actually valid to cross tables here? */
-               rt->rt_gwroute = rtalloc(gate, RT_REPORT|RT_RESOLVE, tableid);
-               /*
-                * If we switched gateways, grab the MTU from the new
-                * gateway route if the current MTU is 0 or greater
-                * than the MTU of gateway.
-                * Note that, if the MTU of gateway is 0, we will reset the
-                * MTU of the route to run PMTUD again from scratch. XXX
-                */
-               if (rt->rt_gwroute && !(rt->rt_rmx.rmx_locks & RTV_MTU) &&
-                   rt->rt_rmx.rmx_mtu &&
-                   rt->rt_rmx.rmx_mtu > rt->rt_gwroute->rt_rmx.rmx_mtu) {
-                       rt->rt_rmx.rmx_mtu = rt->rt_gwroute->rt_rmx.rmx_mtu;
-               }
-       }
+
        return (0);
 }
 
@@ -1080,28 +1135,21 @@ rt_checkgate(struct ifnet *ifp, struct r
 
        KASSERT(rt != NULL);
 
-       if ((rt->rt_flags & RTF_UP) == 0) {
-               rt = rtalloc(dst, RT_REPORT|RT_RESOLVE, rtableid);
-               if (rt == NULL)
-                       return (EHOSTUNREACH);
-               rt->rt_refcnt--;
-               if (rt->rt_ifp != ifp)
-                       return (EHOSTUNREACH);
-       }
+       if ((rt->rt_flags & RTF_UP) == 0)
+               return (EHOSTUNREACH);
 
        rt0 = rt;
 
        if (rt->rt_flags & RTF_GATEWAY) {
-               if (rt->rt_gwroute && !(rt->rt_gwroute->rt_flags & RTF_UP)) {
+               if (rt->rt_gwroute == NULL)
+                       return (EHOSTUNREACH);
+
+               if ((rt->rt_gwroute->rt_flags & RTF_UP) == 0) {
                        rtfree(rt->rt_gwroute);
                        rt->rt_gwroute = NULL;
+                       return (EHOSTUNREACH);
                }
-               if (rt->rt_gwroute == NULL) {
-                       rt->rt_gwroute = rtalloc(rt->rt_gateway,
-                           RT_REPORT|RT_RESOLVE, rtableid);
-                       if (rt->rt_gwroute == NULL)
-                               return (EHOSTUNREACH);
-               }
+
                /*
                 * Next hop must be reachable, this also prevents rtentry
                 * loops, for example when rt->rt_gwroute points to rt.
Index: net/route.h
===================================================================
RCS file: /cvs/src/sys/net/route.h,v
retrieving revision 1.109
diff -u -p -r1.109 route.h
--- net/route.h 18 Jul 2015 15:51:16 -0000      1.109
+++ net/route.h 12 Aug 2015 12:20:45 -0000
@@ -111,6 +111,8 @@ struct rtentry {
 };
 #define        rt_use          rt_rmx.rmx_pksent
 #define        rt_expire       rt_rmx.rmx_expire
+#define        rt_locks        rt_rmx.rmx_locks
+#define        rt_mtu          rt_rmx.rmx_mtu
 
 #define        RTF_UP          0x1             /* route usable */
 #define        RTF_GATEWAY     0x2             /* destination is a gateway */
@@ -354,7 +356,7 @@ void         rt_sendmsg(struct rtentry *, int, 
 void    rt_sendaddrmsg(struct rtentry *, int);
 void    rt_missmsg(int, struct rt_addrinfo *, int, struct ifnet *, int,
            u_int);
-int     rt_setgate(struct rtentry *, struct sockaddr *, unsigned int);
+int     rt_setgate(struct rtentry *, struct sockaddr *);
 int     rt_checkgate(struct ifnet *, struct rtentry *, struct sockaddr *,
            unsigned int, struct rtentry **);
 void    rt_setmetrics(u_long, struct rt_metrics *, struct rt_kmetrics *);
Index: net/rtsock.c
===================================================================
RCS file: /cvs/src/sys/net/rtsock.c,v
retrieving revision 1.166
diff -u -p -r1.166 rtsock.c
--- net/rtsock.c        18 Jul 2015 21:58:06 -0000      1.166
+++ net/rtsock.c        12 Aug 2015 12:20:45 -0000
@@ -748,9 +748,8 @@ report:
                                    info.rti_info[RTAX_GATEWAY]->sa_len)) {
                                        newgate = 1;
                                }
-                       if (info.rti_info[RTAX_GATEWAY] != NULL &&
-                           (error = rt_setgate(rt, info.rti_info[RTAX_GATEWAY],
-                            tableid)))
+                       if (info.rti_info[RTAX_GATEWAY] != NULL && (error =
+                           rt_setgate(rt, info.rti_info[RTAX_GATEWAY])))
                                goto flush;
                        /*
                         * new gateway could require new ifaddr, ifp;

Reply via email to