Re: [committed] amdgcn: fix GFX10/GFX11 VGPR counts

Andrew Stubbs Mon, 20 Oct 2025 02:56:24 -0700

On 18/10/2025 07:58, Richard Biener wrote:

Am 18.10.2025 um 08:28 schrieb Thomas Schwinge <[email protected]>:

Hi!

On 2025-10-17T15:55:44+0100, Andrew Stubbs <[email protected]> wrote:

On 17/10/2025 15:35, Thomas Schwinge wrote:
On 2025-09-09T16:52:57+0000, Andrew Stubbs <[email protected]> wrote:

The previous definition had all the GFX11 register counts doubled to fix a bug
that was encountered in early testing.  This seems to have been a
misunderstanding of the problem (which is no longer reproducible).


I can't comment on the historic aspects, but I can tell that since this
commit r16-3726-g7bc2e311688ac279f1abc2a47944e5b763f7ec89
"amdgcn: fix GFX10/GFX11 VGPR counts", '-march=gfx1100' testing is
completely broken; nothing but:

     Memory access fault by GPU node-2 (Agent handle: [...]) on address (nil). 
Reason: Page not present or supervisor privilege.

May I 'git push' my 'git revert', or should I keep that local, awaiting
your investigation?


It works for me!??????


Mystery resolved: I was using LLVM 15 tools (GNU Guix 15.0.7) vs. Andrew
using some "21.0.0git" version.  Step-wise upgrading (GNU Guix): 16.0.6,
17.0.6, 18.1.8 still fail in the same way, but then with 19.1.7 it's good
once again.

How to proceed?  LLVM 19 has been released just one year ago, in summer
2024.  Is that too recent to require ("for users of affected
configurations", which I can't tell which exactly those are)?  We could
go back to the previous GCC/GCN code generation -- maybe conditionally on
the LLVM version available, or conditionally on a feature/bug fix
'configure'-time check yet to be determined?


I think requiring LLVM 19 or up is fine.  It _is_ annoying that we need to tap 
into those tools for assembler/linker.  Which part is the issue here?  
Assembler or linker?


I believe it is the assembler that encodes the meta-data.

When I first did the RDNA support patch I observed many test failures.Debugging the issue in rocgdb showed that there were fewer availableregisters than requested, by half, so I doubled all the counts andeverything worked.

This made sense because "wave64" mode means that, internally, it usestwo 32-lane registers to form one 64-lane register. It seemed plausiblethat the metadata might expose this detail. However, if that is the caseit seems like LLVM has chosen to hide that internal detail, so nowthere's no need to workaround the issue in GCC any more.

I removed the patch because I was observing the opposite problem:testcases that use a lot of register were exceeding the maximum number,causing "invalid ISA" errors at runtime. After pulling my hair outtrying to figure out how many registers real devices actually have,using the somewhat vague resources available, I eventually came to theconclusion that the double-counting was wrong. At the time I assumedthat I had originally been in error and confused two issues orsomething, but now it looks like it was actually the assembler that changed.

I didn't notice the problem until this summer because the testcase inquestion didn't start to fully vectorize (and therefore use moreregisters) until I added the new patches.


Andrew

Re: [committed] amdgcn: fix GFX10/GFX11 VGPR counts

Reply via email to