On 11/07/15(Sat) 15:16, Stefan Sperling wrote: > On Sat, Jul 11, 2015 at 09:54:34AM +0200, Stefan Sperling wrote: > > While run_task() iterates through the host command queue during a scan, > > the interface may get reset via run_stop() and run_init() when switching > > bands (2GHz -> 5GHz) in run_media_change(). > > > > The list state gets corrupted because run_init() resets the list counters > > while run_task() is still iterating the list. run_task() will now attempt > > to call garbage or NULL callback pointers. > > > > run_newstate_cb SCAN -> INIT > > run_newstate_cb SCAN -> SCAN
How is this possible? Why it isn't INIT -> SCAN? > > run_newstate_cb SCAN -> SCAN > > run_task ring->next=7 > > run_task ring->cur=0 > > run_task ring->queued=-7 <-- negative nonesense > > run_task cmd=0xffff800000651110 > > run_task cmd->cb=0x0 > > > > The kernel now explodes trying to call cmd->cb. > > > > This can be reproduced by running 'ifconfig run0 mediaopt monitor up'. > > My previous duff cured a symptom. > I believe this new diff closes the actual race. I'm really not found of either diffs > State transitions can already be waiting to be scheduled while run_stop() > is scheduling another transition trying to bring the device back to INIT. > Each state transition happens asynchronously in a process context. ^^^^^^^^^^^^^^ I guess it happens only with USB drivers and the net80211 stack is not suppose to work this way, right? > So if multiple transitions are scheduled they can happen out of order. I can think of two cases when such thing can happen: when a state transition schedule another one or when a timeout fires. Do you know what happens in the run(4) case? > When injecting a state transition "from the side" to force the device > into a particular state, like run_stop() does, we need to make sure the > net80211 stack is not concurrently trying to transition the driver for > other purposes, such as the scanning loop. What are the contexts of "from the side" and "net80211 stack"? > This issue probably affects a number of other wifi drivers as well. > If this fix is good to go I'll do a tree-wide sweep soon. I'd like to understand what you're fix is fixing 8) Are we chasing a problem of "ic_state" not being synchronized with the reality because some code in interrupt context scheduled a state transition or are we forcing multiple state transitions to not interleave an INIT change? In any case, if we want to go MP, we should have a better understanding of these state transitions :) > Please disregard all the other hackish patches I mailed out this morning. > > Index: if_run.c > =================================================================== > RCS file: /cvs/src/sys/dev/usb/if_run.c,v > retrieving revision 1.109 > diff -u -p -r1.109 if_run.c > --- if_run.c 12 Jun 2015 15:47:31 -0000 1.109 > +++ if_run.c 11 Jul 2015 13:00:21 -0000 > @@ -1751,6 +1751,11 @@ run_do_async(struct run_softc *sc, void > return; > > s = splusb(); > + if (ring->queued == RUN_HOST_CMD_RING_COUNT) { > + splx(s); > + printf("%s: host cmd queue overrun\n", sc->sc_dev.dv_xname); > + return; /* XXX */ > + } > cmd = &ring->cmd[ring->cur]; > cmd->cb = cb; > KASSERT(len <= sizeof (cmd->data)); > @@ -4504,7 +4509,9 @@ run_init(struct ifnet *ifp) > } > > /* init host command ring */ > - sc->cmdq.cur = sc->cmdq.next = sc->cmdq.queued = 0; > + if (sc->cmdq.queued != 0) > + panic("outstanding host commands queued"); > + sc->cmdq.cur = sc->cmdq.next = 0; > > /* init Tx rings (4 EDCAs) */ > for (qid = 0; qid < 4; qid++) { > @@ -4739,9 +4746,17 @@ run_stop(struct ifnet *ifp, int disable) > timeout_del(&sc->calib_to); > > s = splusb(); > + > + /* > + * Wait for all queued asynchronous commands to complete > + * before switching to INIT state. > + */ > + usb_wait_task(sc->sc_udev, &sc->sc_task); > + > ieee80211_new_state(ic, IEEE80211_S_INIT, -1); > - /* wait for all queued asynchronous commands to complete */ > + /* Wait for asynchronous state transition to complete. */ > usb_wait_task(sc->sc_udev, &sc->sc_task); > + > splx(s); > > /* Disable Tx/Rx DMA. */ >