Re: GCN RDNA2+ vs. GCC SLP vectorizer

Thomas Schwinge Mon, 19 Feb 2024 08:35:29 -0800

Hi!

On 2024-02-19T17:31:20+0100, I wrote:
> On 2024-02-19T11:52:55+0100, Richard Biener <rguent...@suse.de> wrote:
>> On Mon, 19 Feb 2024, Thomas Schwinge wrote:
>>> On 2024-02-16T14:53:04+0100, I wrote:
>>> > On 2024-02-16T12:41:06+0000, Andrew Stubbs <a...@baylibre.com> wrote:
>>> >> On 16/02/2024 12:26, Richard Biener wrote:
>>> >>> On Fri, 16 Feb 2024, Andrew Stubbs wrote:
>>> >>>> On 16/02/2024 10:17, Richard Biener wrote:
>>> >>>>> On Fri, 16 Feb 2024, Thomas Schwinge wrote:
>>> >>>>>> On 2023-10-20T12:51:03+0100, Andrew Stubbs <a...@codesourcery.com> 
>>> >>>>>> wrote:
>>> >>>>>>> I've committed this patch
>>> >>>>>>
>>> >>>>>> ... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
>>> >>>>>> "amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later 
>>> >>>>>> RDNA3/gfx1100
>>> >>>>>> support builds on top of, and that's what I'm currently working on
>>> >>>>>> getting proper GCC/GCN target (not offloading) results for.
>>> >>>>>>
>>> >>>>>> Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably 
>>> >>>>>> simple,
>>> >>>>>> and hopefully representative for other SLP execution test FAILs
>>> >>>>>> (regressions compared to my earlier non-gfx1100 testing).
>>> >>>>>>
>>> >>>>>>       $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
>>> >>>>>>       source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
>>> >>>>>>       --sysroot=install/amdgcn-amdhsa -ftree-vectorize
>>> >>>>>>       -fno-tree-loop-distribute-patterns -fno-vect-cost-model 
>>> >>>>>> -fno-common
>>> >>>>>>       -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
>>> >>>>>>       build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
>>> >>>>>>       source-gcc/newlib/libc/include
>>> >>>>>>       -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
>>> >>>>>>       -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
>>> >>>>>>       setarch,--addr-no-randomize -fdump-tree-all-all 
>>> >>>>>> -fdump-ipa-all-all
>>> >>>>>>       -fdump-rtl-all-all -save-temps -march=gfx1100
>>> >>>>>>
>>> >>>>>> The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
>>> >>>>>> 'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), 
>>> >>>>>> so I
>>> >>>>>> suppose will also exhibit the same failure mode, once again?
>>> >>>>>>
>>> >>>>>> Compared to '-march=gfx90a', the differences begin in
>>> >>>>>> 'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.
>>> >>>>>>
>>> >>>>>> Changed like:
>>> >>>>>>
>>> >>>>>>       @@ -38,10 +38,10 @@ int main ()
>>> >>>>>>        #pragma GCC novector
>>> >>>>>>          for (i = 1; i < N; i++)
>>> >>>>>>            if (a[i] != i%4 + 1)
>>> >>>>>>       -      abort ();
>>> >>>>>>       +      __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
>>> >>>>>>        
>>> >>>>>>          if (a[0] != 5)
>>> >>>>>>       -    abort ();
>>> >>>>>>       +    __builtin_printf("%d %d != %d\n", 0, a[0], 5);
>>> >>>>>>
>>> >>>>>> ..., we see:
>>> >>>>>>
>>> >>>>>>       $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>>> >>>>>>       40 5 != 1
>>> >>>>>>       41 6 != 2
>>> >>>>>>       42 7 != 3
>>> >>>>>>       43 8 != 4
>>> >>>>>>       44 5 != 1
>>> >>>>>>       45 6 != 2
>>> >>>>>>       46 7 != 3
>>> >>>>>>       47 8 != 4
>>> >>>>>>
>>> >>>>>> '40..47' are the 'i = 10..11' in 'foo', and the expectation is
>>> >>>>>> 'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
>>> >>>>>> scribbled zero values over these (vector lane masking issue, 
>>> >>>>>> perhaps?),
>>> >>>>>> or some other code generation issue?
>>> >
>>> >>>> [...], I must be doing something different because vect/bb-slp-cond-1.c
>>> >>>> passes for me, on gfx1100.
>>> >
>>> > That's strange.  I've looked at your log file (looks good), and used your
>>> > toolchain to compile, and your 'gcn-run' to invoke, and still do get:
>>> >
>>> >     $ flock /tmp/gcn.lock ~/gcn-run ~/bb-slp-cond-1.exe
>>> >     GCN Kernel Aborted
>>> >     Kernel aborted
>>> >
>>> > Andrew, later on, please try what happens when you put an unconditional
>>> > 'abort' call into a test case?
>>> 
>>> Andrew, any luck with that yet?
>>> 
>>> Richard, are you able to reproduce the 'gcc.dg/vect/bb-slp-cond-1.c'
>>> execution test failure mentioned above (manual compilation and
>>> 'gcn-run')?
>>
>> No, when manually compiling/running the testcase it works fine for me.
>
> I've updated my GCC master branch sources, but it still fails for me:
>
>     $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ 
> source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c 
> --sysroot=install/amdgcn-amdhsa -isystem 
> build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem 
> source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
> -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -march=gfx1100 -ftree-vectorize 
> -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 
> -save-temps
>     $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
>     GCN Kernel Aborted
>     Kernel aborted
>
> Strange.
>
> In 'bb-slp-cond-1.tar.xz' I'm attaching the files I've built.  Could you
> please compare those to yours and try 'gcn-run gfx1030/a.out'?


Actually: 'gcn-run gfx1030/a.out' a few times -- our dear friend
Nondeterminism seems to be at play here...  :-|


Grüße
 Thomas


>> Didn't yet get to try the .exp files
>>
>> Richard.
>>
>>> 
>>> Gr??e
>>>  Thomas
>>> 
>>> 
>>> >>> I didn't try to run it - when doing make check-gcc fails to using
>>> >>> gcn-run for test invocation
>>> >
>>> > Note, that for such individual test cases, invoking the compiler and then
>>> > 'gcn-run' manually would seem easiest?
>>> >
>>> >>> what's the trick to make it do that?
>>> >
>>> > I tell you've probably not done much "embedded" or simulator testing of
>>> > GCC targets?  ;-P
>>> >
>>> >> There's a config file for nvptx here: 
>>> >> https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp
>>> >
>>> > Yes, and I have pending some updates to that one, to be finished once
>>> > I've generally got my testing set up again, to a sufficient degree...
>>> >
>>> >> You can probably make the obvious adjustments. I think Thomas has a GCN 
>>> >> version with a few more features.
>>> >
>>> > Right.  I'm attaching my current 'amdgcn-amdhsa-run.exp'.
>>> >
>>> > I'm aware that the 'set_board_info gcc,[...] [...]' may be obsolete/wrong
>>> > (as Andrew also noted privately) -- likewise, at least in part, for
>>> > GCC/nvptx, which is where I copied all that from.  (Will revise later;
>>> > not relevant for this discussion, here.)
>>> >
>>> > Similar to what I've recently added to libgomp, there is 'flock'ing here,
>>> > so that you may use 'make -j[...] check' for (partial) parallelism, but
>>> > still all execution testing runs serialized.  I found this to greatly
>>> > help denoise the test results.  (Not ideal, of course, but improving that
>>> > is for later, too.)
>>> >
>>> > You may want to disable the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' thing if
>>> > that doesn't work like that in your case.  (I've no idea what
>>> > 'amdgpu_gpu_recover' would do if the GPU is also used for display.)  But
>>> > this, again, greatly helps denoise test results, at least for the one
>>> > system I'm currently testing on.
>>> >
>>> > I intend to publish proper documentation of all this, later on -- happy
>>> > to answer any questions in the mean time.
>>> >
>>> > If you don't already have a common directory for DejaGnu board files, put
>>> > 'amdgcn-amdhsa-run.exp' into '~/tmp/amdgcn-amdhsa/', for example, and add
>>> > a 'dejagnu.exp' file next to it:
>>> >
>>> >     lappend boards_dir ~/tmp/amdgcn-amdhsa
>>> >
>>> > Prepare:
>>> >
>>> >     $ DEJAGNU=$HOME/tmp/amdgcn-amdhsa/dejagnu.exp
>>> >     $ export DEJAGNU
>>> >     $ AMDGCN_AMDHSA_RUN=[...]/build-gcc/gcc/gcn-run
>>> >     $ export AMDGCN_AMDHSA_RUN
>>> >     $ # If necessary:
>>> >     $ AMDGCN_AMDHSA_LD_LIBRARY_PATH=/opt/rocm/lib
>>> >     $ 
>>> > LD_LIBRARY_PATH=$AMDGCN_AMDHSA_LD_LIBRARY_PATH${LD_LIBRARY_PATH+:$LD_LIBRARY_PATH}
>>> >     $ export LD_LIBRARY_PATH
>>> >
>>> > ..., and then run:
>>> >
>>> >     $ make -j8 check-gcc-c 
>>> > RUNTESTFLAGS='--target_board=amdgcn-amdhsa-run/-march=gfx1030 vect.exp'
>>> >
>>> > Oh, and I saw that on <https://gcc.gnu.org/wiki/Offloading>, Tobias has
>>> > recently put into a new "Using the GPU as stand-alone system" section
>>> > some similar information.  (..., but this should, in my opinion, be on a
>>> > different page, as it's explicitly *not* about what we understand as
>>> > offloading.)
>>> >
>>> >> I usually use the CodeSourcery magic stack of scripts for testing 
>>> >> installed toolchains on remote devices, so I'm not too familiar with 
>>> >> using Dejagnu directly.
>>> >
>>> > Tsk...  ;'-|
>>> >
>>> >
>>> > Gr??e
>>> >  Thomas
>>> 
>>
>> -- 
>> Richard Biener <rguent...@suse.de>
>> SUSE Software Solutions Germany GmbH,
>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: GCN RDNA2+ vs. GCC SLP vectorizer

Reply via email to