this tweaks the guts of if_start so it guarantees that there's only
ever one call to ifp->if_start running in the system at a time.
previously this was implicit because it could only be called with
the KERNEL_LOCK held.

as we move forward it would be nice to run the queue without having
to take the biglock. however, because we also want to dequeue the
packets in order, it only makes sense to run a single instance of
the function in the whole system.

also, if a driver is recovering from an oactive situation (ie, it's
been able to free space on the tx ring) it should be able to start
tx again from an mpsafe interrupt context.

because most of our drivers assume that theyre run under the
KERNEL_LOCK, this diff uses a flag for the internals of the if_start
call to differentiate between them. it defaults for kernel locked,
but drivers can opt in to an mpsafe version that can call ifp->if_start
without the mplock held.

the kernel locked code takes KERNEL_LOCK and splnet before calling
ifp->if_start.

the mpsafe code uses the serialisation mechanism that the scsi
midlayer and pool runqueue use, but implemented with atomics instead
of operations under a mutex.

the semantic is that work will be queued onto a list protected by
a mutex (ie, the guts of struct ifqueue), and then a cpu will try
to enter a critical section that runs a function to service the
queued work. the cpu that enters the critical section has to dequeue
work in a loop, which is what all our drivers do.

if another cpu tries to enter the same critical section after
queueing more work, it will return immediately rather than spin on
the lock. the first cpu that is currently dequeueing work in the
critical section will be told to spin again to guarantee that it
will service the work the other cpu added.

so the network stack may be transmitting packets on cpu1, while an
interrupts on cpu0 occurs which frees up tx descriprots. if cpu0
calls if_start, it will return immediately because cpu1 will end
up doing the work it wanted to do anyway.

if the start routine can run on multiple cpus, then it becomes
necessary to know it is NOT running anymore when tearing a nic down.
to that end i have added an if_start_barrier function. an mpsafe
driver can call that when it's being brought down to guarantee that
another cpu isnt fiddling with the tx ring before freeing it.

a driver opts in to the mpsafe if_start call by doing the following:

1. set ifp->if_xflags = IFXF_MPSAFE.
2. calling if_start() instead of its own start routine (eg, myx_start).
3. clearing IFF_RUNNING before calling if_start_barrier() on its way down.
4. only using IFQ_DEQUEUE (not ifq_deq_begin/commit/rollback)

anyway, this is the diff i have come up with after playing with
several ideas. it removes the IFXF_TXREADY semantics, ie, tx
mitigation and reuses the flag bit for IFXF_MPSAFE.

the reason for that is juggling or deferring the start routine made
if_start_barrier annoyingly complicated, and all my attmepts at it
introduced a significant performance hit or were insanely complicated.

tx mitigation only ever gave me back 5 to 10% before it was badly
tweaked, and we've made a lot of other performance improvements
since then. while im sad to see it go, id rather move forward than
dwell on it.

in the future i would like to try delegating the work to mpsafe
taskqs, but in my attempts i lost something like 30% of my tx rate
by doing that. id like to investigate that further in the future,
just not right now.

finally, the last thing to consider is lock ordering problems.
because contention on the ifq_serializer causes the second context
to return imediately (that's true even if you call if_start from
within a critical section), i think all the problems are avoided.
i am more concerned with the ifq mutex than i am with the serialiser.

anyway, here's the diff to look at. happy to discuss further.

tests would be welcome too.

Index: dev/pci/if_myx.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_myx.c,v
retrieving revision 1.88
diff -u -p -r1.88 if_myx.c
--- dev/pci/if_myx.c    25 Nov 2015 03:09:59 -0000      1.88
+++ dev/pci/if_myx.c    30 Nov 2015 06:30:52 -0000
@@ -512,6 +512,7 @@ myx_attachhook(void *arg)
 
        ifp->if_softc = sc;
        ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
+       ifp->if_xflags = IFXF_MPSAFE;
        ifp->if_ioctl = myx_ioctl;
        ifp->if_start = myx_start;
        ifp->if_watchdog = myx_watchdog;
@@ -1361,8 +1362,9 @@ myx_down(struct myx_softc *sc)
                printf("%s: failed to reset the device\n", DEVNAME(sc));
        }
 
-       CLR(ifp->if_flags, IFF_RUNNING);
        ifq_clr_oactive(&ifp->if_snd);
+       CLR(ifp->if_flags, IFF_RUNNING);
+       if_start_barrier(ifp);
 
        for (ring = 0; ring < 2; ring++) {
                struct myx_rx_ring *mrr = &sc->sc_rx_ring[ring];
@@ -1437,11 +1439,6 @@ myx_start(struct ifnet *ifp)
        u_int                           free, used;
        u_int8_t                        flags;
 
-       if (!ISSET(ifp->if_flags, IFF_RUNNING) ||
-           ifq_is_oactive(&ifp->if_snd) ||
-           IFQ_IS_EMPTY(&ifp->if_snd))
-               return;
-
        idx = sc->sc_tx_ring_prod;
 
        /* figure out space */
@@ -1589,11 +1586,10 @@ int
 myx_intr(void *arg)
 {
        struct myx_softc        *sc = (struct myx_softc *)arg;
-       struct ifnet            *ifp = &sc->sc_ac.ac_if;
        volatile struct myx_status *sts = sc->sc_sts;
        enum myx_state           state;
        bus_dmamap_t             map = sc->sc_sts_dma.mxm_map;
-       u_int32_t                data, start;
+       u_int32_t                data;
        u_int8_t                 valid = 0;
 
        state = sc->sc_state;
@@ -1639,15 +1635,12 @@ myx_intr(void *arg)
        bus_space_write_raw_region_4(sc->sc_memt, sc->sc_memh,
            sc->sc_irqclaimoff + sizeof(data), &data, sizeof(data));
 
-       start = ifq_is_oactive(&ifp->if_snd);
-
        if (sts->ms_statusupdated) {
                if (state == MYX_S_DOWN &&
                    sc->sc_linkdown != sts->ms_linkdown) {
                        sc->sc_state = MYX_S_OFF;
                        membar_producer();
                        wakeup(sts);
-                       start = 0;
                } else {
                        data = sts->ms_linkstate;
                        if (data != 0xffffffff) {
@@ -1661,13 +1654,6 @@ myx_intr(void *arg)
        bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize,
            BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
 
-       if (start) {
-               KERNEL_LOCK();
-               ifq_clr_oactive(&ifp->if_snd);
-               myx_start(ifp);
-               KERNEL_UNLOCK();
-       }
-
        return (1);
 }
 
@@ -1716,6 +1702,10 @@ myx_txeof(struct myx_softc *sc, u_int32_
 
        sc->sc_tx_ring_cons = idx;
        sc->sc_tx_cons = cons;
+
+       ifq_clr_oactive(&ifp->if_snd);
+       if (!ifq_empty(&ifp->if_snd))
+               if_start(ifp);
 }
 
 void
Index: net/if.c
===================================================================
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.413
diff -u -p -r1.413 if.c
--- net/if.c    25 Nov 2015 03:10:00 -0000      1.413
+++ net/if.c    30 Nov 2015 06:30:52 -0000
@@ -81,6 +81,7 @@
 #include <sys/sysctl.h>
 #include <sys/task.h>
 #include <sys/atomic.h>
+#include <sys/proc.h>
 
 #include <dev/rndvar.h>
 
@@ -152,6 +153,9 @@ void        if_input_process(void *);
 void   ifa_print_all(void);
 #endif
 
+void   if_start_mpsafe(struct ifnet *ifp);
+void   if_start_locked(struct ifnet *ifp);
+
 /*
  * interface index map
  *
@@ -535,30 +539,90 @@ if_attach_common(struct ifnet *ifp)
 void
 if_start(struct ifnet *ifp)
 {
+       if (ISSET(ifp->if_xflags, IFXF_MPSAFE))
+               if_start_mpsafe(ifp);
+       else
+               if_start_locked(ifp);
+}
+
+void
+if_start_locked(struct ifnet *ifp)
+{
+       int s;
+
+       KERNEL_LOCK();
+       s = splnet();
+       ifp->if_start(ifp);
+       splx(s);
+       KERNEL_UNLOCK();
+}
+
+static inline unsigned int
+ifq_enter(struct ifqueue *ifq)
+{
+       return (atomic_inc_int_nv(&ifq->ifq_serializer) == 1);
+}
+
+static inline unsigned int
+ifq_leave(struct ifqueue *ifq)
+{
+       if (atomic_cas_uint(&ifq->ifq_serializer, 1, 0) == 1)
+               return (1);
 
-       splassert(IPL_NET);
+       ifq->ifq_serializer = 1;
 
-       if (ifq_len(&ifp->if_snd) >= min(8, ifp->if_snd.ifq_maxlen) &&
-           !ifq_is_oactive(&ifp->if_snd)) {
-               if (ISSET(ifp->if_xflags, IFXF_TXREADY)) {
-                       TAILQ_REMOVE(&iftxlist, ifp, if_txlist);
-                       CLR(ifp->if_xflags, IFXF_TXREADY);
+       return (0);
+}
+
+void
+if_start_mpsafe(struct ifnet *ifp)
+{
+       struct ifqueue *ifq = &ifp->if_snd;
+
+       if (!ifq_enter(ifq))
+               return;
+
+       do {
+               if (__predict_false(!ISSET(ifp->if_flags, IFF_RUNNING))) {
+                       ifq->ifq_serializer = 0;
+                       wakeup_one(&ifq->ifq_serializer);
+                       return;
                }
+
+               if (ifq_empty(ifq) || ifq_is_oactive(ifq))
+                       continue;
+
                ifp->if_start(ifp);
+
+       } while (!ifq_leave(ifq));
+}
+
+void
+if_start_barrier(struct ifnet *ifp)
+{
+       struct sleep_state sls;
+       struct ifqueue *ifq = &ifp->if_snd;
+
+       /* this should only be called from converted drivers */
+       KASSERT(ISSET(ifp->if_xflags, IFXF_MPSAFE));
+
+       /* drivers should only call this on the way down */
+       KASSERT(!ISSET(ifp->if_flags, IFF_RUNNING));
+
+       if (ifq->ifq_serializer == 0)
                return;
-       }
 
-       if (!ISSET(ifp->if_xflags, IFXF_TXREADY)) {
-               SET(ifp->if_xflags, IFXF_TXREADY);
-               TAILQ_INSERT_TAIL(&iftxlist, ifp, if_txlist);
-               schednetisr(NETISR_TX);
-       }
+       if_start_mpsafe(ifp); /* spin the wheel to guarantee a wakeup */
+       do {
+               sleep_setup(&sls, &ifq->ifq_serializer, PWAIT, "ifbar");
+               sleep_finish(&sls, ifq->ifq_serializer != 0);
+       } while (ifq->ifq_serializer != 0);
 }
 
 int
 if_enqueue(struct ifnet *ifp, struct mbuf *m)
 {
-       int s, length, error = 0;
+       int length, error = 0;
        unsigned short mflags;
 
 #if NBRIDGE > 0
@@ -569,17 +633,13 @@ if_enqueue(struct ifnet *ifp, struct mbu
        length = m->m_pkthdr.len;
        mflags = m->m_flags;
 
-       s = splnet();
-
        /*
         * Queue message on interface, and start output if interface
         * not yet active.
         */
        IFQ_ENQUEUE(&ifp->if_snd, m, error);
-       if (error) {
-               splx(s);
+       if (error)
                return (error);
-       }
 
        ifp->if_obytes += length;
        if (mflags & M_MCAST)
@@ -587,8 +647,6 @@ if_enqueue(struct ifnet *ifp, struct mbu
 
        if_start(ifp);
 
-       splx(s);
-
        return (0);
 }
 
@@ -808,21 +866,6 @@ if_input_process(void *xmq)
 }
 
 void
-nettxintr(void)
-{
-       struct ifnet *ifp;
-       int s;
-
-       s = splnet();
-       while ((ifp = TAILQ_FIRST(&iftxlist)) != NULL) {
-               TAILQ_REMOVE(&iftxlist, ifp, if_txlist);
-               CLR(ifp->if_xflags, IFXF_TXREADY);
-               ifp->if_start(ifp);
-       }
-       splx(s);
-}
-
-void
 if_deactivate(struct ifnet *ifp)
 {
        int s;
@@ -902,8 +945,6 @@ if_detach(struct ifnet *ifp)
 
        /* Remove the interface from the list of all interfaces.  */
        TAILQ_REMOVE(&ifnet, ifp, if_list);
-       if (ISSET(ifp->if_xflags, IFXF_TXREADY))
-               TAILQ_REMOVE(&iftxlist, ifp, if_txlist);
 
        while ((ifg = TAILQ_FIRST(&ifp->if_groups)) != NULL)
                if_delgroup(ifp, ifg->ifgl_group->ifg_group);
Index: net/if.h
===================================================================
RCS file: /cvs/src/sys/net/if.h,v
retrieving revision 1.173
diff -u -p -r1.173 if.h
--- net/if.h    20 Nov 2015 12:27:42 -0000      1.173
+++ net/if.h    30 Nov 2015 06:30:52 -0000
@@ -206,14 +206,14 @@ struct if_status_description {
        (IFF_BROADCAST|IFF_POINTOPOINT|IFF_RUNNING|IFF_OACTIVE|\
            IFF_SIMPLEX|IFF_MULTICAST|IFF_ALLMULTI)
 
-#define IFXF_TXREADY           0x1             /* interface is ready to tx */
+#define IFXF_MPSAFE            0x1             /* if_start is mpsafe */
 #define        IFXF_INET6_NOPRIVACY    0x4             /* don't autoconf 
privacy */
 #define        IFXF_MPLS               0x8             /* supports MPLS */
 #define        IFXF_WOL                0x10            /* wake on lan enabled 
*/
 #define        IFXF_AUTOCONF6          0x20            /* v6 autoconf enabled 
*/
 
 #define        IFXF_CANTCHANGE \
-       (IFXF_TXREADY)
+       (IFXF_MPSAFE)
 
 /*
  * Some convenience macros used for setting ifi_baudrate.
Index: net/if_var.h
===================================================================
RCS file: /cvs/src/sys/net/if_var.h,v
retrieving revision 1.58
diff -u -p -r1.58 if_var.h
--- net/if_var.h        25 Nov 2015 03:10:00 -0000      1.58
+++ net/if_var.h        30 Nov 2015 06:30:52 -0000
@@ -134,7 +134,6 @@ struct ifnet {                              /* and the 
entries */
        void    *if_softc;              /* lower-level data for this if */
        struct  refcnt if_refcnt;
        TAILQ_ENTRY(ifnet) if_list;     /* all struct ifnets are chained */
-       TAILQ_ENTRY(ifnet) if_txlist;   /* list of ifnets ready to tx */
        TAILQ_HEAD(, ifaddr) if_addrlist; /* linked list of addresses per if */
        TAILQ_HEAD(, ifmaddr) if_maddrlist; /* list of multicast records */
        TAILQ_HEAD(, ifg_list) if_groups; /* linked list of groups per if */
@@ -369,6 +368,7 @@ extern struct ifnet_head ifnet;
 extern unsigned int lo0ifidx;
 
 void   if_start(struct ifnet *);
+void   if_start_barrier(struct ifnet *);
 int    if_enqueue_try(struct ifnet *, struct mbuf *);
 int    if_enqueue(struct ifnet *, struct mbuf *);
 void   if_input(struct ifnet *, struct mbuf_list *);
Index: net/netisr.c
===================================================================
RCS file: /cvs/src/sys/net/netisr.c,v
retrieving revision 1.7
diff -u -p -r1.7 netisr.c
--- net/netisr.c        20 Jul 2015 21:16:39 -0000      1.7
+++ net/netisr.c        30 Nov 2015 06:30:52 -0000
@@ -68,8 +68,6 @@ netintr(void *unused) /* ARGSUSED */
        if (t & (1 << NETISR_PFSYNC))
                pfsyncintr();
 #endif
-       if (t & (1 << NETISR_TX))
-               nettxintr();
 }
 
 void

Reply via email to