On 1/30/21 10:29 PM, Alex Elder wrote: > On 1/30/21 9:25 AM, Willem de Bruijn wrote: >> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <el...@linaro.org> wrote: >>> >>> The channel stop and suspend paths both call __gsi_channel_stop(), >>> which quiesces channel activity, disables NAPI, and (on other than >>> SDM845) stops the channel. Similarly, the start and resume paths >>> share __gsi_channel_start(), which starts the channel and re-enables >>> NAPI again. >>> >>> Disabling NAPI should be done when stopping a channel, but this >>> should *not* be done when suspending. It's not necessary in the >>> suspend path anyway, because the stopped channel (or suspended >>> endpoint on SDM845) will not cause interrupts to schedule NAPI, >>> and gsi_channel_trans_quiesce() won't return until there are no >>> more transactions to process in the NAPI polling loop. >> >> But why is it incorrect to do so? > > Maybe it's not; I also thought it was fine before, but... > > Someone at Qualcomm asked me why I thought NAPI needed > to be disabled on suspend. My response was basically > that it was a lightweight operation, and it shouldn't > really be a problem to do so. > > Then, when I posted two patches last month, Jakub's > response told me he didn't understand why I was doing > what I was doing, and I stepped back to reconsider > the details of what was happening at suspend time. > > https://lore.kernel.org/netdev/20210107183803.47308...@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
I should have mentioned that *this* response from Jakub to a question I had also led to my conclusion that NAPI should not be disabled on suspend--at least for IPA. https://lore.kernel.org/netdev/20210105122328.3e556...@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ The channel is *not* reset on suspend for IPA. -Alex > Four things were happening to suspend a channel: > quiesce activity; disable interrupt; disable NAPI; > and stop the channel. It occurred to me that a > stopped channel would not generate interrupts, so if > the channel was stopped earlier there would be no need > to disable the interrupt. Similarly there would be > (essentially) no need to disable NAPI once a channel > was stopped. > > Underlying all of this is that I started chasing a > hang that was occurring on suspend over a month ago. > It was hard to reproduce (hundreds or thousands of > suspend/resume cycles without hitting it), and one > of the few times I actually hit the problem it was > stuck in napi_disable(), apparently waiting for > NAPI_STATE_SCHED to get cleared by napi_complete(). > > My best guess about how this could occur was if there > were a race of some kind between the interrupt handler > (scheduling NAPI) and the poll function (completing > it). I found a number of problems while looking > at this, and in the past few weeks I've posted some > fixes to improve things. Still, even with some of > these fixes in place we have seen a hang (but now > even more rarely). > > So this grand rework of suspending/stopping channels > is an attempt to resolve this hang on suspend. > > The channel is now stopped early, and once stopped, > everything that completed prior to the channel being > stopped is polled before considering the suspend > function done. A stopped channel won't interrupt, > so we don't bother disabling the completion interrupt, > with no interrupts, NAPI won't be scheduled, so there's > no need to disable NAPI either. > > The net result is simpler, and seems logical, and > should preclude any possible race between the interrupt > handler and poll function. I'm trying to solve the > hang problem analytically, because it takes *so* long > to reproduce. > > I'm open to other suggestions. > > -Alex > >> From a quick look, virtio-net disables on both remove and freeze, for >> instance. >> >>> Instead, enable NAPI in gsi_channel_start(), when the completion >>> interrupt is first enabled. Disable it again in gsi_channel_stop(), >>> when finally disabling the interrupt. >>> >>> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure >>> NAPI polling is done before moving on. >>> >>> Signed-off-by: Alex Elder <el...@linaro.org> >>> --- >> = >>> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id) >>> struct gsi_channel *channel = &gsi->channel[channel_id]; >>> int ret; >>> >>> - /* Enable the completion interrupt */ >>> + /* Enable NAPI and the completion interrupt */ >>> + napi_enable(&channel->napi); >>> gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id); >>> >>> ret = __gsi_channel_start(channel, true); >>> - if (ret) >>> - gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id); >>> + if (!ret) >>> + return 0; >>> + >>> + gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id); >>> + napi_disable(&channel->napi); >>> >>> return ret; >>> } >> >> subjective, but easier to parse when the normal control flow is linear >> and the error path takes a branch (or goto, if reused). >> >