On 18/10/2025 07:58, Richard Biener wrote:


Am 18.10.2025 um 08:28 schrieb Thomas Schwinge <[email protected]>:

Hi!

On 2025-10-17T15:55:44+0100, Andrew Stubbs <[email protected]> wrote:
On 17/10/2025 15:35, Thomas Schwinge wrote:
On 2025-09-09T16:52:57+0000, Andrew Stubbs <[email protected]> wrote:
The previous definition had all the GFX11 register counts doubled to fix a bug
that was encountered in early testing.  This seems to have been a
misunderstanding of the problem (which is no longer reproducible).

I can't comment on the historic aspects, but I can tell that since this
commit r16-3726-g7bc2e311688ac279f1abc2a47944e5b763f7ec89
"amdgcn: fix GFX10/GFX11 VGPR counts", '-march=gfx1100' testing is
completely broken; nothing but:

     Memory access fault by GPU node-2 (Agent handle: [...]) on address (nil). 
Reason: Page not present or supervisor privilege.

May I 'git push' my 'git revert', or should I keep that local, awaiting
your investigation?

It works for me!??????

Mystery resolved: I was using LLVM 15 tools (GNU Guix 15.0.7) vs. Andrew
using some "21.0.0git" version.  Step-wise upgrading (GNU Guix): 16.0.6,
17.0.6, 18.1.8 still fail in the same way, but then with 19.1.7 it's good
once again.

How to proceed?  LLVM 19 has been released just one year ago, in summer
2024.  Is that too recent to require ("for users of affected
configurations", which I can't tell which exactly those are)?  We could
go back to the previous GCC/GCN code generation -- maybe conditionally on
the LLVM version available, or conditionally on a feature/bug fix
'configure'-time check yet to be determined?

I think requiring LLVM 19 or up is fine.  It _is_ annoying that we need to tap 
into those tools for assembler/linker.  Which part is the issue here?  
Assembler or linker?

I believe it is the assembler that encodes the meta-data.

When I first did the RDNA support patch I observed many test failures. Debugging the issue in rocgdb showed that there were fewer available registers than requested, by half, so I doubled all the counts and everything worked.

This made sense because "wave64" mode means that, internally, it uses two 32-lane registers to form one 64-lane register. It seemed plausible that the metadata might expose this detail. However, if that is the case it seems like LLVM has chosen to hide that internal detail, so now there's no need to workaround the issue in GCC any more.

I removed the patch because I was observing the opposite problem: testcases that use a lot of register were exceeding the maximum number, causing "invalid ISA" errors at runtime. After pulling my hair out trying to figure out how many registers real devices actually have, using the somewhat vague resources available, I eventually came to the conclusion that the double-counting was wrong. At the time I assumed that I had originally been in error and confused two issues or something, but now it looks like it was actually the assembler that changed.

I didn't notice the problem until this summer because the testcase in question didn't start to fully vectorize (and therefore use more registers) until I added the new patches.

Andrew

Reply via email to