On Fri, 14 Nov 2025, Alex Bennée wrote:

> Ilpo Järvinen <[email protected]> writes:
> 
> > Hi all,
> >
> > Thanks to issue reports from Simon Richter and Alex Bennée, I
> > discovered BAR resize rollback can corrupt the resource tree. As fixing
> > corruption requires avoiding overlapping resource assignments, the
> > correct fix can unfortunately results in worse user experience, what
> > appeared to be "working" previously might no longer do so. Thus, I had
> > to do a larger rework to pci_resize_resource() in order to properly
> > restore resource states as it was prior to BAR resize.
> <snip>
> >
> > base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
> 
> Ahh I have applied to 6.18-rc5 with minor conflicts and can verify that
> on my AVA the AMD GPU shows up again and I can run inference jobs
> against it. So for that case:
> 
> Tested-by: Alex Bennée <[email protected]>

Thanks for testing! (And saving me the effort of backporting to 6.17 :-))

I'd be interested to see the dmesg with this series applied just to check 
there isn't anything else I should still look at (even if it now appears 
to work).

You seemed to have only a few io resource assignment failures to occur 
during BAR resize which might be the reason the kernel thought rollback 
is necessary (so AFAICT, the rollback likely was entirely unnecessary as 
the mem resources did assign successfully).

I made the resize to ignore unrelated (reoccuring) io resource failures in 
the commit 31af09b3eaf3 ("PCI: Fix failure detection during resource 
resize"), but that might not have been backported to 6.15 you took the log 
from (in the initial report). So kernel might not even do rollback at all 
with 6.18-rc5.

-- 
 i.

Reply via email to