Hi Thomas and all,
A gentle ping on this patch. It has been about three weeks since submission. This patch fixes a critical soft lockup issue we observed when the ASPEED hardware becomes unresponsive. Since it uses a safe timeout mechanism without altering the existing API signatures, it should be a straightforward and low-risk mitigation. Could anyone please help review this when you have a moment, or let me know if any modifications are needed? Best regards, Mingyu At 2026-05-13 19:39:49, [email protected] wrote: >From: Mingyu Wang <[email protected]> > >While validating the driver using DevGen (a framework that synthesizes >virtual device models directly from driver source code via LLM guidance), >a severe soft lockup was observed. > >The hardware polling loops in `__ast_mindwm`, `__ast_moutdwm`, and >`ast_2500_patch_ahb` lack a timeout mechanism. On bare-metal systems, >if the ASPEED chip becomes unresponsive or a PCIe bus fault occurs, >the CPU will spin indefinitely in these loops. This results in a system >hang, triggering the watchdog soft lockup and causing subsequent I/O >starvation (e.g., blocking jbd2). > >Fix this by introducing a bounded loop with a safe timeout of >approximately 100ms using `udelay(10)`. Using `udelay()` ensures >that the fix remains safe even if these accessors are called from >an atomic context or while holding spinlocks. If the hardware fails >to respond, the loop breaks and emits a `WARN_ONCE`, allowing the >kernel to degrade gracefully and preventing complete system paralysis. > >Signed-off-by: Mingyu Wang <[email protected]> >--- >Hi Thomas, > >Thanks for the prompt response and confirmation! > >Instead of just waiting for threshold suggestions, I have drafted this >patch to address the soft lockup. Since changing the return types of >`__ast_mindwm` and `__ast_moutdwm` to propagate error codes (e.g., >`-ETIMEDOUT`) would require an intrusive refactoring across the entire >AST driver, I took a more defensive, minimal-invasive approach. > >To avoid the risk of sleeping in an atomic context (in case these >low-level I/O accessors are ever called under a spinlock), I used >a bounded loop with `udelay(10)` and a maximum of 10000 iterations >(approx. 100ms total timeout). If the ASPEED hardware completely >fails to respond, it breaks the infinite loop, emits a `WARN_ONCE`, >and prevents the CPU from halting the entire system. > >Please let me know if you think the 100ms threshold and the `udelay` >approach are appropriate for this specific AHB/SCU hardware sequence. > > drivers/gpu/drm/ast/ast_2500.c | 8 +++++++- > drivers/gpu/drm/ast/ast_post.c | 16 ++++++++++++++-- > 2 files changed, 21 insertions(+), 3 deletions(-) > >diff --git a/drivers/gpu/drm/ast/ast_2500.c b/drivers/gpu/drm/ast/ast_2500.c >index 2a52af0ded56..08d18f90201a 100644 >--- a/drivers/gpu/drm/ast/ast_2500.c >+++ b/drivers/gpu/drm/ast/ast_2500.c >@@ -107,6 +107,7 @@ static const u32 >ast2500_ddr4_1600_timing_table[REGTBL_NUM] = { > void ast_2500_patch_ahb(void __iomem *regs) > { > u32 data; >+ int retries = 10000; /* ~100ms timeout */ > > /* Clear bus lock condition */ > __ast_moutdwm(regs, 0x1e600000, 0xAEED1A03); >@@ -136,7 +137,12 @@ void ast_2500_patch_ahb(void __iomem *regs) > do { > __ast_moutdwm(regs, 0x1e6e2000, 0x1688A8A8); > data = __ast_mindwm(regs, 0x1e6e2000); >- } while (data != 1); >+ if (data == 1) >+ break; >+ udelay(10); >+ } while (--retries); >+ >+ WARN_ONCE(!retries, "ast: timeout waiting for AHB patch\n"); > > __ast_moutdwm(regs, 0x1e6e207c, 0x08000000); /* clear fast reset */ > } >diff --git a/drivers/gpu/drm/ast/ast_post.c b/drivers/gpu/drm/ast/ast_post.c >index b72914dbed38..66eb80925e27 100644 >--- a/drivers/gpu/drm/ast/ast_post.c >+++ b/drivers/gpu/drm/ast/ast_post.c >@@ -37,13 +37,19 @@ > u32 __ast_mindwm(void __iomem *regs, u32 r) > { > u32 data; >+ int retries = 10000; /* ~100ms timeout */ > > __ast_write32(regs, 0xf004, r & 0xffff0000); > __ast_write32(regs, 0xf000, 0x1); > > do { > data = __ast_read32(regs, 0xf004) & 0xffff0000; >- } while (data != (r & 0xffff0000)); >+ if (data == (r & 0xffff0000)) >+ break; >+ udelay(10); >+ } while (--retries); >+ >+ WARN_ONCE(!retries, "ast: timeout reading from AHB/SCU\n"); > > return __ast_read32(regs, 0x10000 + (r & 0x0000ffff)); > } >@@ -51,13 +57,19 @@ u32 __ast_mindwm(void __iomem *regs, u32 r) > void __ast_moutdwm(void __iomem *regs, u32 r, u32 v) > { > u32 data; >+ int retries = 10000; /* ~100ms timeout */ > > __ast_write32(regs, 0xf004, r & 0xffff0000); > __ast_write32(regs, 0xf000, 0x1); > > do { > data = __ast_read32(regs, 0xf004) & 0xffff0000; >- } while (data != (r & 0xffff0000)); >+ if (data == (r & 0xffff0000)) >+ break; >+ udelay(10); >+ } while (--retries); >+ >+ WARN_ONCE(!retries, "ast: timeout writing to AHB/SCU\n"); > > __ast_write32(regs, 0x10000 + (r & 0x0000ffff), v); > } >-- >2.34.1
