On Sat, Jan 30, 2021 at 11:29 PM Alex Elder <el...@linaro.org> wrote: > > On 1/30/21 9:25 AM, Willem de Bruijn wrote: > > On Fri, Jan 29, 2021 at 3:29 PM Alex Elder <el...@linaro.org> wrote: > >> > >> The channel stop and suspend paths both call __gsi_channel_stop(), > >> which quiesces channel activity, disables NAPI, and (on other than > >> SDM845) stops the channel. Similarly, the start and resume paths > >> share __gsi_channel_start(), which starts the channel and re-enables > >> NAPI again. > >> > >> Disabling NAPI should be done when stopping a channel, but this > >> should *not* be done when suspending. It's not necessary in the > >> suspend path anyway, because the stopped channel (or suspended > >> endpoint on SDM845) will not cause interrupts to schedule NAPI, > >> and gsi_channel_trans_quiesce() won't return until there are no > >> more transactions to process in the NAPI polling loop. > > > > But why is it incorrect to do so? > > Maybe it's not; I also thought it was fine before, but... > > Someone at Qualcomm asked me why I thought NAPI needed > to be disabled on suspend. My response was basically > that it was a lightweight operation, and it shouldn't > really be a problem to do so. > > Then, when I posted two patches last month, Jakub's > response told me he didn't understand why I was doing > what I was doing, and I stepped back to reconsider > the details of what was happening at suspend time. > > https://lore.kernel.org/netdev/20210107183803.47308...@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ > > Four things were happening to suspend a channel: > quiesce activity; disable interrupt; disable NAPI; > and stop the channel. It occurred to me that a > stopped channel would not generate interrupts, so if > the channel was stopped earlier there would be no need > to disable the interrupt. Similarly there would be > (essentially) no need to disable NAPI once a channel > was stopped. > > Underlying all of this is that I started chasing a > hang that was occurring on suspend over a month ago. > It was hard to reproduce (hundreds or thousands of > suspend/resume cycles without hitting it), and one > of the few times I actually hit the problem it was > stuck in napi_disable(), apparently waiting for > NAPI_STATE_SCHED to get cleared by napi_complete().
This is important information. What exactly do you mean by hang? > > My best guess about how this could occur was if there > were a race of some kind between the interrupt handler > (scheduling NAPI) and the poll function (completing > it). I found a number of problems while looking > at this, and in the past few weeks I've posted some > fixes to improve things. Still, even with some of > these fixes in place we have seen a hang (but now > even more rarely). > > So this grand rework of suspending/stopping channels > is an attempt to resolve this hang on suspend. Do you have any data that this patchset resolves the issue, or is it too hard to reproduce to say anything? > The channel is now stopped early, and once stopped, > everything that completed prior to the channel being > stopped is polled before considering the suspend > function done. Does the call to gsi_channel_trans_quiesce before gsi_channel_stop_retry leave a race where new transactions may occur until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly knowing the details of this device. > A stopped channel won't interrupt, > so we don't bother disabling the completion interrupt, > with no interrupts, NAPI won't be scheduled, so there's > no need to disable NAPI either. That sounds plausible. But it doesn't explain why napi_disable "should *not* be done when suspending" as the commit states. Arguably, leaving that won't have much effect either way, and is in line with other drivers. Your previous patchset mentions "When stopping a channel, the IPA driver currently disables NAPI before disabling the interrupt." That would no longer be the case. > The net result is simpler, and seems logical, and > should preclude any possible race between the interrupt > handler and poll function. I'm trying to solve the > hang problem analytically, because it takes *so* long > to reproduce. > > I'm open to other suggestions. > > -Alex > > > From a quick look, virtio-net disables on both remove and freeze, for > > instance. > > > >> Instead, enable NAPI in gsi_channel_start(), when the completion > >> interrupt is first enabled. Disable it again in gsi_channel_stop(), > >> when finally disabling the interrupt. > >> > >> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure > >> NAPI polling is done before moving on. > >> > >> Signed-off-by: Alex Elder <el...@linaro.org> > >> --- > > = > >> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 > >> channel_id) > >> struct gsi_channel *channel = &gsi->channel[channel_id]; > >> int ret; > >> > >> - /* Enable the completion interrupt */ > >> + /* Enable NAPI and the completion interrupt */ > >> + napi_enable(&channel->napi); > >> gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id); > >> > >> ret = __gsi_channel_start(channel, true); > >> - if (ret) > >> - gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id); > >> + if (!ret) > >> + return 0; > >> + > >> + gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id); > >> + napi_disable(&channel->napi); > >> > >> return ret; > >> } > > > > subjective, but easier to parse when the normal control flow is linear > > and the error path takes a branch (or goto, if reused). > > >