On Tue, 20 Feb 2024, Thomas Schwinge wrote: > Hi Richard! > > On 2024-02-20T08:44:35+0100, Richard Biener <rguent...@suse.de> wrote: > > On Mon, 19 Feb 2024, Thomas Schwinge wrote: > >> On 2024-02-19T17:31:20+0100, I wrote: > >> > On 2024-02-19T11:52:55+0100, Richard Biener <rguent...@suse.de> wrote: > >> >> On Mon, 19 Feb 2024, Thomas Schwinge wrote: > >> >>> On 2024-02-16T14:53:04+0100, I wrote: > >> >>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <a...@baylibre.com> wrote: > >> >>> >> On 16/02/2024 12:26, Richard Biener wrote: > >> >>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote: > >> >>> >>>> On 16/02/2024 10:17, Richard Biener wrote: > >> >>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote: > >> >>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs > >> >>> >>>>>> <a...@codesourcery.com> wrote: > >> >>> >>>>>>> I've committed this patch > >> >>> >>>>>> > >> >>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691 > >> >>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later > >> >>> >>>>>> RDNA3/gfx1100 > >> >>> >>>>>> support builds on top of, and that's what I'm currently working > >> >>> >>>>>> on > >> >>> >>>>>> getting proper GCC/GCN target (not offloading) results for. > >> >>> >>>>>> > >> >>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is > >> >>> >>>>>> reasonably simple, > >> >>> >>>>>> and hopefully representative for other SLP execution test FAILs > >> >>> >>>>>> (regressions compared to my earlier non-gfx1100 testing). > >> >>> >>>>>> > >> >>> >>>>>> $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ > >> >>> >>>>>> source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c > >> >>> >>>>>> --sysroot=install/amdgcn-amdhsa -ftree-vectorize > >> >>> >>>>>> -fno-tree-loop-distribute-patterns -fno-vect-cost-model > >> >>> >>>>>> -fno-common > >> >>> >>>>>> -O2 -fdump-tree-slp-details -fdump-tree-vect-details > >> >>> >>>>>> -isystem > >> >>> >>>>>> build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include > >> >>> >>>>>> -isystem > >> >>> >>>>>> source-gcc/newlib/libc/include > >> >>> >>>>>> -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ > >> >>> >>>>>> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper > >> >>> >>>>>> setarch,--addr-no-randomize -fdump-tree-all-all > >> >>> >>>>>> -fdump-ipa-all-all > >> >>> >>>>>> -fdump-rtl-all-all -save-temps -march=gfx1100 > >> >>> >>>>>> > >> >>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart > >> >>> >>>>>> from > >> >>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in > >> >>> >>>>>> 'gcn_target_asm_function_prologue'), so I > >> >>> >>>>>> suppose will also exhibit the same failure mode, once again? > >> >>> >>>>>> > >> >>> >>>>>> Compared to '-march=gfx90a', the differences begin in > >> >>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to > >> >>> >>>>>> 'a-bb-slp-cond-1.s'. > >> >>> >>>>>> > >> >>> >>>>>> Changed like: > >> >>> >>>>>> > >> >>> >>>>>> @@ -38,10 +38,10 @@ int main () > >> >>> >>>>>> #pragma GCC novector > >> >>> >>>>>> for (i = 1; i < N; i++) > >> >>> >>>>>> if (a[i] != i%4 + 1) > >> >>> >>>>>> - abort (); > >> >>> >>>>>> + __builtin_printf("%d %d != %d\n", i, a[i], i%4 + > >> >>> >>>>>> 1); > >> >>> >>>>>> > >> >>> >>>>>> if (a[0] != 5) > >> >>> >>>>>> - abort (); > >> >>> >>>>>> + __builtin_printf("%d %d != %d\n", 0, a[0], 5); > >> >>> >>>>>> > >> >>> >>>>>> ..., we see: > >> >>> >>>>>> > >> >>> >>>>>> $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out > >> >>> >>>>>> 40 5 != 1 > >> >>> >>>>>> 41 6 != 2 > >> >>> >>>>>> 42 7 != 3 > >> >>> >>>>>> 43 8 != 4 > >> >>> >>>>>> 44 5 != 1 > >> >>> >>>>>> 45 6 != 2 > >> >>> >>>>>> 46 7 != 3 > >> >>> >>>>>> 47 8 != 4 > >> >>> >>>>>> > >> >>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is > >> >>> >>>>>> 'a[i * stride + 0..3] != 0'. So, either some earlier iteration > >> >>> >>>>>> has > >> >>> >>>>>> scribbled zero values over these (vector lane masking issue, > >> >>> >>>>>> perhaps?), > >> >>> >>>>>> or some other code generation issue? > >> >>> > > >> >>> >>>> [...], I must be doing something different because > >> >>> >>>> vect/bb-slp-cond-1.c > >> >>> >>>> passes for me, on gfx1100. > >> >>> > > >> >>> > That's strange. I've looked at your log file (looks good), and used > >> >>> > your > >> >>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get: > >> >>> > > >> >>> > $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe > >> >>> > GCN Kernel Aborted > >> >>> > Kernel aborted > >> >>> > > >> >>> > Andrew, later on, please try what happens when you put an > >> >>> > unconditional > >> >>> > 'abort' call into a test case? > >> >>> > >> >>> Andrew, any luck with that yet? > >> >>> > >> >>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c' > >> >>> execution test failure mentioned above (manual compilation and > >> >>> 'gcn-run')? > >> >> > >> >> No, when manually compiling/running the testcase it works fine for me. > >> > > >> > I've updated my GCC master branch sources, but it still fails for me: > >> > > >> > $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ > >> > source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c > >> > --sysroot=install/amdgcn-amdhsa -isystem > >> > build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem > >> > source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ > >> > -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize > >> > -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 > >> > -save-temps > >> > $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out > >> > GCN Kernel Aborted > >> > Kernel aborted > >> > > >> > Strange. > >> > > >> > In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built. Could you > >> > please compare those to yours and try 'gcn-run gfx1030/a.out'? > >> > >> Actually: 'gcn-run gfx1030/a.out' a few times -- our dear friend > >> Nondeterminism seems to be at play here... :-| > > > > What's your set of compile options? I don't manage to get close > > to your gfx1030 assembly when using your preprocessed source ... > > > > I've tried -march=gfx1030 -O[23] [-fno-vect-cost-model] > > See the 'xgcc' command line just a few lines above? ;-) > > -ftree-vectorize -fno-tree-loop-distribute-patterns -fno-vect-cost-model > -fno-common -O2 > > That's what I originally found in 'gcc.log'.
OK, with that I can reproduce the issue. -O2 -ftree-vectorize seems to be enough to trigger it. It's indeed somewhat random whether it fails or not ... Richard. > > Gr??e > Thomas > > > > Looks like you use -fno-omit-frame-pointer but then I still see > > -mine +yours > > > > - v_readlane_b32 s18, v4, 0 > > - v_readlane_b32 s19, v5, 0 > > - s_add_u32 s18, s18, s26 > > - s_addc_u32 s19, s19, s27 > > - v_writelane_b32 v4, s18, 0 > > - v_writelane_b32 v5, s19, 0 > > - s_mov_b32 s18, s14 > > - s_mov_b32 s19, s15 > > - s_mov_b32 s22, scc > > - s_add_u32 s18, s18, 4096 > > - s_addc_u32 s19, s19, 0 > > - s_cmpk_lg_u32 s22, 0 > > - v_writelane_b32 v6, s18, 0 > > - v_writelane_b32 v7, s19, 0 > > - flat_store_dwordx2 v[6:7], v[4:5] > > + v_writelane_b32 v6, s26, 0 > > + v_writelane_b32 v7, s27, 0 > > + v_add_co_u32 v4, vcc, v6, v4 > > + v_add_co_ci_u32 v5, vcc, v7, v5, vcc > > > > and more changes. > > > > Richard. > > > >> > >> Gr??e > >> Thomas > >> > >> > >> >> Didn't yet get to try the .exp files > >> >> > >> >> Richard. > >> >> > >> >>> > >> >>> Gr??e > >> >>> Thomas > >> >>> > >> >>> > >> >>> >>> I didn't try to run it - when doing make check-gcc fails to using > >> >>> >>> gcn-run for test invocation > >> >>> > > >> >>> > Note, that for such individual test cases, invoking the compiler and > >> >>> > then > >> >>> > 'gcn-run' manually would seem easiest? > >> >>> > > >> >>> >>> what's the trick to make it do that? > >> >>> > > >> >>> > I tell you've probably not done much "embedded" or simulator testing > >> >>> > of > >> >>> > GCC targets? ;-P > >> >>> > > >> >>> >> There's a config file for nvptx here: > >> >>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp > >> >>> > > >> >>> > Yes, and I have pending some updates to that one, to be finished once > >> >>> > I've generally got my testing set up again, to a sufficient degree... > >> >>> > > >> >>> >> You can probably make the obvious adjustments. I think Thomas has a > >> >>> >> GCN > >> >>> >> version with a few more features. > >> >>> > > >> >>> > Right. I'm attaching my current 'amdgcn-amdhsa-run.exp'. > >> >>> > > >> >>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be > >> >>> > obsolete/wrong > >> >>> > (as Andrew also noted privately) -- likewise, at least in part, for > >> >>> > GCC/nvptx, which is where I copied all that from. (Will revise > >> >>> > later; > >> >>> > not relevant for this discussion, here.) > >> >>> > > >> >>> > Similar to what I've recently added to libgomp, there is 'flock'ing > >> >>> > here, > >> >>> > so that you may use 'make -j[...] check' for (partial) parallelism, > >> >>> > but > >> >>> > still all execution testing runs serialized. I found this to greatly > >> >>> > help denoise the test results. (Not ideal, of course, but improving > >> >>> > that > >> >>> > is for later, too.) > >> >>> > > >> >>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' > >> >>> > thing if > >> >>> > that doesn't work like that in your case. (I've no idea what > >> >>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.) > >> >>> > But > >> >>> > this, again, greatly helps denoise test results, at least for the one > >> >>> > system I'm currently testing on. > >> >>> > > >> >>> > I intend to publish proper documentation of all this, later on -- > >> >>> > happy > >> >>> > to answer any questions in the mean time. > >> >>> > > >> >>> > If you don't already have a common directory for DejaGnu board > >> >>> > files, put > >> >>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, > >> >>> > and add > >> >>> > a 'dejagnu.exp' file next to it: > >> >>> > > >> >>> > lappend boards_dir ~/tmp/amdgcn-amdhsa > >> >>> > > >> >>> > Prepare: > >> >>> > > >> >>> > $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp > >> >>> > $ export DEJAGNU > >> >>> > $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run > >> >>> > $ export AMDGCN_AMDHSA_RUN > >> >>> > $ # If necessary: > >> >>> > $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib > >> >>> > $ > >> >>> > LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH} > >> >>> > $ export LD_LIBRARY_PATH > >> >>> > > >> >>> > ..., and then run: > >> >>> > > >> >>> > $ make -j8 check-gcc-c > >> >>> > RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 > >> >>> > vect.exp' > >> >>> > > >> >>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias > >> >>> > has > >> >>> > recently put into a new "Using the GPU as stand-alone system" section > >> >>> > some similar information. (..., but this should, in my opinion, be > >> >>> > on a > >> >>> > different page, as it's explicitly *not* about what we understand as > >> >>> > offloading.) > >> >>> > > >> >>> >> I usually use the CodeSourcery magic stack of scripts for testing > >> >>> >> installed toolchains on remote devices, so I'm not too familiar > >> >>> >> with > >> >>> >> using Dejagnu directly. > >> >>> > > >> >>> > Tsk... ;'-| > >> >>> > > >> >>> > > >> >>> > Gr??e > >> >>> > Thomas > >> >>> > >> >> > >> >> -- > >> >> Richard Biener <rguent...@suse.de> > >> >> SUSE Software Solutions Germany GmbH, > >> >> Frankenstrasse 146, 90461 Nuernberg, Germany; > >> >> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > >> >> Nuernberg) > >> > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)