On Thu, 1 Feb 2024, Andre Vieira (lists) wrote:
>
>
> On 01/02/2024 07:19, Richard Biener wrote:
> > On Wed, 31 Jan 2024, Andre Vieira (lists) wrote:
> >
> >
> > The patch didn't come with a testcase so it's really hard to tell
> > what goes wrong now and how it is fixed ...
>
> My bad! I had a testcase locally but never added it...
>
> However... now I look at it and ran it past Richard S, the codegen isn't
> 'wrong', but it does have the potential to lead to some pretty slow codegen,
> especially for inbranch simdclones where it transforms the SVE predicate into
> an Advanced SIMD vector by inserting the elements one at a time...
>
> An example of which can be seen if you do:
>
> gcc -O3 -march=armv8-a+sve -msve-vector-bits=128 -fopenmp-simd t.c -S
>
> with the following t.c:
> #pragma omp declare simd simdlen(4) inbranch
> int __attribute__ ((const)) fn5(int);
>
> void fn4 (int *a, int *b, int n)
> {
> for (int i = 0; i < n; ++i)
> b[i] = fn5(a[i]);
> }
>
> Now I do have to say, for our main usecase of libmvec we won't have any
> 'inbranch' Advanced SIMD clones, so we avoid that issue... But of course that
> doesn't mean user-code will.
It seems to use SVE masks with vector(4) <signed-boolean:4> and the
ABI says the mask is vector(4) int. You say that's because we choose
a Adv SIMD clone for the SVE VLS vector code (it calls _ZGVnM4v_fn5).
The vectorizer creates
_44 = VEC_COND_EXPR <loop_mask_41, { 1, 1, 1, 1 }, { 0, 0, 0, 0 }>;
and then vector lowering decomposes this. That means the vectorizer
lacks a check that the target handles this VEC_COND_EXPR.
Of course I would expect that SVE with VLS vectors is able to
code generate this operation, so it's missing patterns in the end.
Richard.
> I'm gonna remove this patch and run another test regression to see if it
> catches anything weird, but if not then I guess we do have the option to not
> use this patch and aim to solve the costing or codegen issue in GCC-15. We
> don't currently do any simdclone costing and I don't have a clear suggestion
> for how given openmp has no mechanism that I know off to expose the speedup of
> a simdclone over it's scalar variant, so how would we 'compare' a simdclone
> call with extra overhead of argument preparation vs scalar, though at least we
> could prefer a call to a different simdclone with less argument preparation.
> Anyways I digress.
>
> Other tests, these require aarch64-autovec-preference=2 so that also has me
> worried less...
>
> gcc -O3 -march=armv8-a+sve -msve-vector-bits=128 --param
> aarch64-autovec-preference=2 -fopenmp-simd t.c -S
>
> t.c:
> #pragma omp declare simd simdlen(2) notinbranch
> float __attribute__ ((const)) fn1(double);
>
> void fn0 (float *a, float *b, int n)
> {
> for (int i = 0; i < n; ++i)
> b[i] = fn1((double) a[i]);
> }
>
> #pragma omp declare simd simdlen(2) notinbranch
> float __attribute__ ((const)) fn3(float);
>
> void fn2 (float *a, double *b, int n)
> {
> for (int i = 0; i < n; ++i)
> b[i] = (double) fn3(a[i]);
> }
>
> > Richard.
> >
> >>>
> >>> That said, I wonder how we end up mixing things up in the first place.
> >>>
> >>> Richard.
> >>
> >
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)