On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote: > On Tue, May 6, 2025 at 8:24 AM Will Deacon <[email protected]> wrote: > > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote: > > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > index > > > c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 > > > 100644 > > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c > > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, > > > void *dev) > > > if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT)) > > > return IRQ_NONE; > > > > > > + /* > > > + * On some implementations FSR.SS asserts a context fault > > > + * interrupt. We do not want this behavior, because resolving the > > > + * original context fault typically requires operations that cannot > > > be > > > + * performed in IRQ context but leaving the stall unacknowledged > > > will > > > + * immediately lead to another spurious interrupt as FSR.SS is still > > > + * set. Work around this by disabling interrupts for this context > > > bank. > > > + * It's expected that interrupts are re-enabled after resuming the > > > + * translation. > > > > s/translation/transaction/ > > > > > + * > > > + * We have to do this before report_iommu_fault() so that we don't > > > + * leave interrupts disabled in case the downstream user decides the > > > + * fault can be resolved inside its fault handler. > > > + * > > > + * There is a possible race if there are multiple context banks > > > sharing > > > + * the same interrupt and both signal an interrupt in between > > > writing > > > + * RESUME and SCTLR. We could disable interrupts here before we > > > + * re-enable them in the resume handler, leaving interrupts enabled. > > > + * Lock the write to serialize it with the resume handler. > > > + */ > > > > I'm struggling to understand this last part. If the resume handler runs > > synchronously from report_iommu_fault(), then there's no need for > > locking because we're in interrupt context. If the resume handler can > > run asynchronously from report_iommu_fault(), then the locking doesn't > > help because the code below could clear CFIE right after the resume > > handler has set it. > > The problem is indeed when the resume handler runs asynchronously. > Clearing CFIE right after the resume handler has set it is normal and > expected. The issue is the opposite, i.e. something like: > > - Resume handler writes RESUME and stalls for some reason > - The interrupt handler runs through and clears CFIE while it's already > cleared > - Resume handler sets CFIE, assuming that the handler hasn't run yet > but it actually has > > This wouldn't happen with only one context bank, because we wouldn't > get an interrupt until the resume handler sets CFIE, but with multiple > context banks and a shared interrupt line we could get a "spurious" > interrupt due to a fault in an earlier context bank that becomes not > spurious if the resume handler writes RESUME before the context fault > handler for this bank reads FSR above.
Ah, gotcha. Thanks for the explanation. If we moved the RESUME+CFIE into the interrupt handler after the call to report_iommu_fault(), would it be possible to run the handler as a threaded irq (see 'context_fault_needs_threaded_irq') and handle the callback synchronously? In that case, I think we could avoid taking the lock if we wrote CFIE _before_ RESUME. Will
