this tweaks the guts of if_start so it guarantees that there's only ever one call to ifp->if_start running in the system at a time. previously this was implicit because it could only be called with the KERNEL_LOCK held.
as we move forward it would be nice to run the queue without having to take the biglock. however, because we also want to dequeue the packets in order, it only makes sense to run a single instance of the function in the whole system. also, if a driver is recovering from an oactive situation (ie, it's been able to free space on the tx ring) it should be able to start tx again from an mpsafe interrupt context. because most of our drivers assume that theyre run under the KERNEL_LOCK, this diff uses a flag for the internals of the if_start call to differentiate between them. it defaults for kernel locked, but drivers can opt in to an mpsafe version that can call ifp->if_start without the mplock held. the kernel locked code takes KERNEL_LOCK and splnet before calling ifp->if_start. the mpsafe code uses the serialisation mechanism that the scsi midlayer and pool runqueue use, but implemented with atomics instead of operations under a mutex. the semantic is that work will be queued onto a list protected by a mutex (ie, the guts of struct ifqueue), and then a cpu will try to enter a critical section that runs a function to service the queued work. the cpu that enters the critical section has to dequeue work in a loop, which is what all our drivers do. if another cpu tries to enter the same critical section after queueing more work, it will return immediately rather than spin on the lock. the first cpu that is currently dequeueing work in the critical section will be told to spin again to guarantee that it will service the work the other cpu added. so the network stack may be transmitting packets on cpu1, while an interrupts on cpu0 occurs which frees up tx descriprots. if cpu0 calls if_start, it will return immediately because cpu1 will end up doing the work it wanted to do anyway. if the start routine can run on multiple cpus, then it becomes necessary to know it is NOT running anymore when tearing a nic down. to that end i have added an if_start_barrier function. an mpsafe driver can call that when it's being brought down to guarantee that another cpu isnt fiddling with the tx ring before freeing it. a driver opts in to the mpsafe if_start call by doing the following: 1. set ifp->if_xflags = IFXF_MPSAFE. 2. calling if_start() instead of its own start routine (eg, myx_start). 3. clearing IFF_RUNNING before calling if_start_barrier() on its way down. 4. only using IFQ_DEQUEUE (not ifq_deq_begin/commit/rollback) anyway, this is the diff i have come up with after playing with several ideas. it removes the IFXF_TXREADY semantics, ie, tx mitigation and reuses the flag bit for IFXF_MPSAFE. the reason for that is juggling or deferring the start routine made if_start_barrier annoyingly complicated, and all my attmepts at it introduced a significant performance hit or were insanely complicated. tx mitigation only ever gave me back 5 to 10% before it was badly tweaked, and we've made a lot of other performance improvements since then. while im sad to see it go, id rather move forward than dwell on it. in the future i would like to try delegating the work to mpsafe taskqs, but in my attempts i lost something like 30% of my tx rate by doing that. id like to investigate that further in the future, just not right now. finally, the last thing to consider is lock ordering problems. because contention on the ifq_serializer causes the second context to return imediately (that's true even if you call if_start from within a critical section), i think all the problems are avoided. i am more concerned with the ifq mutex than i am with the serialiser. anyway, here's the diff to look at. happy to discuss further. tests would be welcome too. Index: dev/pci/if_myx.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_myx.c,v retrieving revision 1.88 diff -u -p -r1.88 if_myx.c --- dev/pci/if_myx.c 25 Nov 2015 03:09:59 -0000 1.88 +++ dev/pci/if_myx.c 30 Nov 2015 06:30:52 -0000 @@ -512,6 +512,7 @@ myx_attachhook(void *arg) ifp->if_softc = sc; ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; + ifp->if_xflags = IFXF_MPSAFE; ifp->if_ioctl = myx_ioctl; ifp->if_start = myx_start; ifp->if_watchdog = myx_watchdog; @@ -1361,8 +1362,9 @@ myx_down(struct myx_softc *sc) printf("%s: failed to reset the device\n", DEVNAME(sc)); } - CLR(ifp->if_flags, IFF_RUNNING); ifq_clr_oactive(&ifp->if_snd); + CLR(ifp->if_flags, IFF_RUNNING); + if_start_barrier(ifp); for (ring = 0; ring < 2; ring++) { struct myx_rx_ring *mrr = &sc->sc_rx_ring[ring]; @@ -1437,11 +1439,6 @@ myx_start(struct ifnet *ifp) u_int free, used; u_int8_t flags; - if (!ISSET(ifp->if_flags, IFF_RUNNING) || - ifq_is_oactive(&ifp->if_snd) || - IFQ_IS_EMPTY(&ifp->if_snd)) - return; - idx = sc->sc_tx_ring_prod; /* figure out space */ @@ -1589,11 +1586,10 @@ int myx_intr(void *arg) { struct myx_softc *sc = (struct myx_softc *)arg; - struct ifnet *ifp = &sc->sc_ac.ac_if; volatile struct myx_status *sts = sc->sc_sts; enum myx_state state; bus_dmamap_t map = sc->sc_sts_dma.mxm_map; - u_int32_t data, start; + u_int32_t data; u_int8_t valid = 0; state = sc->sc_state; @@ -1639,15 +1635,12 @@ myx_intr(void *arg) bus_space_write_raw_region_4(sc->sc_memt, sc->sc_memh, sc->sc_irqclaimoff + sizeof(data), &data, sizeof(data)); - start = ifq_is_oactive(&ifp->if_snd); - if (sts->ms_statusupdated) { if (state == MYX_S_DOWN && sc->sc_linkdown != sts->ms_linkdown) { sc->sc_state = MYX_S_OFF; membar_producer(); wakeup(sts); - start = 0; } else { data = sts->ms_linkstate; if (data != 0xffffffff) { @@ -1661,13 +1654,6 @@ myx_intr(void *arg) bus_dmamap_sync(sc->sc_dmat, map, 0, map->dm_mapsize, BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE); - if (start) { - KERNEL_LOCK(); - ifq_clr_oactive(&ifp->if_snd); - myx_start(ifp); - KERNEL_UNLOCK(); - } - return (1); } @@ -1716,6 +1702,10 @@ myx_txeof(struct myx_softc *sc, u_int32_ sc->sc_tx_ring_cons = idx; sc->sc_tx_cons = cons; + + ifq_clr_oactive(&ifp->if_snd); + if (!ifq_empty(&ifp->if_snd)) + if_start(ifp); } void Index: net/if.c =================================================================== RCS file: /cvs/src/sys/net/if.c,v retrieving revision 1.413 diff -u -p -r1.413 if.c --- net/if.c 25 Nov 2015 03:10:00 -0000 1.413 +++ net/if.c 30 Nov 2015 06:30:52 -0000 @@ -81,6 +81,7 @@ #include <sys/sysctl.h> #include <sys/task.h> #include <sys/atomic.h> +#include <sys/proc.h> #include <dev/rndvar.h> @@ -152,6 +153,9 @@ void if_input_process(void *); void ifa_print_all(void); #endif +void if_start_mpsafe(struct ifnet *ifp); +void if_start_locked(struct ifnet *ifp); + /* * interface index map * @@ -535,30 +539,90 @@ if_attach_common(struct ifnet *ifp) void if_start(struct ifnet *ifp) { + if (ISSET(ifp->if_xflags, IFXF_MPSAFE)) + if_start_mpsafe(ifp); + else + if_start_locked(ifp); +} + +void +if_start_locked(struct ifnet *ifp) +{ + int s; + + KERNEL_LOCK(); + s = splnet(); + ifp->if_start(ifp); + splx(s); + KERNEL_UNLOCK(); +} + +static inline unsigned int +ifq_enter(struct ifqueue *ifq) +{ + return (atomic_inc_int_nv(&ifq->ifq_serializer) == 1); +} + +static inline unsigned int +ifq_leave(struct ifqueue *ifq) +{ + if (atomic_cas_uint(&ifq->ifq_serializer, 1, 0) == 1) + return (1); - splassert(IPL_NET); + ifq->ifq_serializer = 1; - if (ifq_len(&ifp->if_snd) >= min(8, ifp->if_snd.ifq_maxlen) && - !ifq_is_oactive(&ifp->if_snd)) { - if (ISSET(ifp->if_xflags, IFXF_TXREADY)) { - TAILQ_REMOVE(&iftxlist, ifp, if_txlist); - CLR(ifp->if_xflags, IFXF_TXREADY); + return (0); +} + +void +if_start_mpsafe(struct ifnet *ifp) +{ + struct ifqueue *ifq = &ifp->if_snd; + + if (!ifq_enter(ifq)) + return; + + do { + if (__predict_false(!ISSET(ifp->if_flags, IFF_RUNNING))) { + ifq->ifq_serializer = 0; + wakeup_one(&ifq->ifq_serializer); + return; } + + if (ifq_empty(ifq) || ifq_is_oactive(ifq)) + continue; + ifp->if_start(ifp); + + } while (!ifq_leave(ifq)); +} + +void +if_start_barrier(struct ifnet *ifp) +{ + struct sleep_state sls; + struct ifqueue *ifq = &ifp->if_snd; + + /* this should only be called from converted drivers */ + KASSERT(ISSET(ifp->if_xflags, IFXF_MPSAFE)); + + /* drivers should only call this on the way down */ + KASSERT(!ISSET(ifp->if_flags, IFF_RUNNING)); + + if (ifq->ifq_serializer == 0) return; - } - if (!ISSET(ifp->if_xflags, IFXF_TXREADY)) { - SET(ifp->if_xflags, IFXF_TXREADY); - TAILQ_INSERT_TAIL(&iftxlist, ifp, if_txlist); - schednetisr(NETISR_TX); - } + if_start_mpsafe(ifp); /* spin the wheel to guarantee a wakeup */ + do { + sleep_setup(&sls, &ifq->ifq_serializer, PWAIT, "ifbar"); + sleep_finish(&sls, ifq->ifq_serializer != 0); + } while (ifq->ifq_serializer != 0); } int if_enqueue(struct ifnet *ifp, struct mbuf *m) { - int s, length, error = 0; + int length, error = 0; unsigned short mflags; #if NBRIDGE > 0 @@ -569,17 +633,13 @@ if_enqueue(struct ifnet *ifp, struct mbu length = m->m_pkthdr.len; mflags = m->m_flags; - s = splnet(); - /* * Queue message on interface, and start output if interface * not yet active. */ IFQ_ENQUEUE(&ifp->if_snd, m, error); - if (error) { - splx(s); + if (error) return (error); - } ifp->if_obytes += length; if (mflags & M_MCAST) @@ -587,8 +647,6 @@ if_enqueue(struct ifnet *ifp, struct mbu if_start(ifp); - splx(s); - return (0); } @@ -808,21 +866,6 @@ if_input_process(void *xmq) } void -nettxintr(void) -{ - struct ifnet *ifp; - int s; - - s = splnet(); - while ((ifp = TAILQ_FIRST(&iftxlist)) != NULL) { - TAILQ_REMOVE(&iftxlist, ifp, if_txlist); - CLR(ifp->if_xflags, IFXF_TXREADY); - ifp->if_start(ifp); - } - splx(s); -} - -void if_deactivate(struct ifnet *ifp) { int s; @@ -902,8 +945,6 @@ if_detach(struct ifnet *ifp) /* Remove the interface from the list of all interfaces. */ TAILQ_REMOVE(&ifnet, ifp, if_list); - if (ISSET(ifp->if_xflags, IFXF_TXREADY)) - TAILQ_REMOVE(&iftxlist, ifp, if_txlist); while ((ifg = TAILQ_FIRST(&ifp->if_groups)) != NULL) if_delgroup(ifp, ifg->ifgl_group->ifg_group); Index: net/if.h =================================================================== RCS file: /cvs/src/sys/net/if.h,v retrieving revision 1.173 diff -u -p -r1.173 if.h --- net/if.h 20 Nov 2015 12:27:42 -0000 1.173 +++ net/if.h 30 Nov 2015 06:30:52 -0000 @@ -206,14 +206,14 @@ struct if_status_description { (IFF_BROADCAST|IFF_POINTOPOINT|IFF_RUNNING|IFF_OACTIVE|\ IFF_SIMPLEX|IFF_MULTICAST|IFF_ALLMULTI) -#define IFXF_TXREADY 0x1 /* interface is ready to tx */ +#define IFXF_MPSAFE 0x1 /* if_start is mpsafe */ #define IFXF_INET6_NOPRIVACY 0x4 /* don't autoconf privacy */ #define IFXF_MPLS 0x8 /* supports MPLS */ #define IFXF_WOL 0x10 /* wake on lan enabled */ #define IFXF_AUTOCONF6 0x20 /* v6 autoconf enabled */ #define IFXF_CANTCHANGE \ - (IFXF_TXREADY) + (IFXF_MPSAFE) /* * Some convenience macros used for setting ifi_baudrate. Index: net/if_var.h =================================================================== RCS file: /cvs/src/sys/net/if_var.h,v retrieving revision 1.58 diff -u -p -r1.58 if_var.h --- net/if_var.h 25 Nov 2015 03:10:00 -0000 1.58 +++ net/if_var.h 30 Nov 2015 06:30:52 -0000 @@ -134,7 +134,6 @@ struct ifnet { /* and the entries */ void *if_softc; /* lower-level data for this if */ struct refcnt if_refcnt; TAILQ_ENTRY(ifnet) if_list; /* all struct ifnets are chained */ - TAILQ_ENTRY(ifnet) if_txlist; /* list of ifnets ready to tx */ TAILQ_HEAD(, ifaddr) if_addrlist; /* linked list of addresses per if */ TAILQ_HEAD(, ifmaddr) if_maddrlist; /* list of multicast records */ TAILQ_HEAD(, ifg_list) if_groups; /* linked list of groups per if */ @@ -369,6 +368,7 @@ extern struct ifnet_head ifnet; extern unsigned int lo0ifidx; void if_start(struct ifnet *); +void if_start_barrier(struct ifnet *); int if_enqueue_try(struct ifnet *, struct mbuf *); int if_enqueue(struct ifnet *, struct mbuf *); void if_input(struct ifnet *, struct mbuf_list *); Index: net/netisr.c =================================================================== RCS file: /cvs/src/sys/net/netisr.c,v retrieving revision 1.7 diff -u -p -r1.7 netisr.c --- net/netisr.c 20 Jul 2015 21:16:39 -0000 1.7 +++ net/netisr.c 30 Nov 2015 06:30:52 -0000 @@ -68,8 +68,6 @@ netintr(void *unused) /* ARGSUSED */ if (t & (1 << NETISR_PFSYNC)) pfsyncintr(); #endif - if (t & (1 << NETISR_TX)) - nettxintr(); } void