Hi Andrew,

Andrew Stubbs wrote:
On 05/02/2025 11:14, Tobias Burnus wrote:
Therefore, the following GPUs are now supported in addition: gfx902, gfx904, gfx909, gfx1031, gfx1032, gfx1033, gfx1034, gfx1035, gfx1101, gfx1102, gfx1150, gfx1151, gfx1152, and gfx1153. However, the multilib config has not been touched, hence, those 14 device types and gfx{9,10-3,11}-generic are not supported by default. Currently, the following 9 GPUs are enabled by default:gfx900, gfx906, gfx908, gfx90a, gfx90c, gfx1030, gfx1036, gfx1100, andgfx1103.

I'm not too happy about adding a whole list of specific devices that we have not tested. So far, whenever I have added a new device there have been meta-data oddities and such-like that needed to be tweaked.

Well, the idea is: If AMD has collected them under the same generic name, the ISA must be compatible. The LLVM page lists some restrictions (such as not having sramecc support when using generic) but none of the listed items match what we have.

I fail to see how an ISA that works with, e.g., gfx9-generic will suddenly fail when compiling for it with gfx902, which except for the ELF flag contains identical code.

I also don't like adding knowledge of unsupported devices purely for improving diagnostics.

I think we have the option to delegate the checking purely to ROCm. Then gfx9-generic will run on gfx909 – or we do our own checking. But then we need to somehow know whether gfx9-generic code will run on gfx909 or not – or we bluntly reject it.

It's fine for the known-unsupported devices, but wait a month or so and there will be new unknown-unsupported devices, and the message degrades again. Worse, the new diagnostic can recommend trying -march=<name> for devices which the compiler will recognize but have never been tested, and probably don't have multilibs configured.

The having-no-multilib-configured issue is difficult to come by, unless we want to filter them out when building libgomp. We could do so, however, by doing some preprocessing.

The problem is that we then need to have two checks:

(a) Whether it runs (if we don't relegate it to ROCm) – in that case, gfx902 hardware with gfx9-generic should just work, even if there is neither a gfx902 nor gfx9-generic multilib. After all, the user managed to link the executable.

(b) When recompiling on the same system as running the build, suggesting a -march=gfx... that has a multilib would be better, i.e. here the filtered-out value could be helpful.

(c) For suggesting generic, we also would need to check the ROCm version to only propose it when ROCm is > 6.3, assuming that's the thing.

BTW: The issue of having no multilib configured is not really new. We had it before with fiji or when the user configured GCC in some non-default way. (As we currently enable all GPUs by default. But I think we didn't do so for a while for the gfx1... ones. But I don't recall whether we did do so for a release or not.)

And I think the error is also not to illegible:

ld: error: unable to find library -lgfortran

gcn mkoffload: fatal error: .../x86_64-pc-linux-gnu-accel-amdgcn-amdhsa-gcc returned 1 exit status


A better approach might be to pattern-match "gfx{9,10,11}" in the name HSA gives you for the physical device and recommend generic -march=gfx{9,10,11}-generic in those cases?

I think that will be way worse. — gfx908 and gfx90a are *not* compatible with gfx9-generic. Similarly, gfx94{0,1,2}/gfx950 are gfx9 devices but only in gfx9-4-generic and not supported by us. And for gfx10, we only support gfx10-3-generic, i.e. gfx103x (technically x = 0...6, currently only 0 and 3), but not gfx10-1-generic (gfx101{0,1,2,3}).

Thus, I think it is way better to assume that GPUs listed for each gfx*-generic as having identical ISA than any other proposed way. We could hard code this ourselves (as done in the patch) or to do it by letting ROCm do the job.

(There are some restrictions listed, like "not all VGPR can be used on gfx1100" but as we added gfx1103, we can just use the gfx1103 settings as gfx1100 does not have those features, either.)

Thus, I still regard my proposed approach as superior.


I'm happy to add the new gfx9-generic, and improving the diagnostics is always good, but I'm not convinced about making it look like we support devices we've never tested.

As mentioned, AMD regards them as compatible. I am happy to add some wording like "(unsupported)" to the -march= documentation, in case it helps.

Tobias

Reply via email to