[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929

Chris Elrod  changed:

   What|Removed |Added

 CC||elrodc at gmail dot com

--- Comment #29 from Chris Elrod  ---
"RESOLVED FIXED". I haven't tried this with `target`, but avx512bw does not
work with target_clones with gcc 11.2, but it does with clang 14.

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929

--- Comment #30 from Chris Elrod  ---
> #if defined(__clang__)
> #define MULTIVERSION  
>  \
> __attribute__((target_clones("avx512dq", "avx2", "default")))
> #else
> #define MULTIVERSION  
>  \
> __attribute__((target_clones( 
>  \
> "arch=skylake-avx512,arch=cascadelake,arch=icelake-client,arch="  
>  \
> "tigerlake,"  
>  \
> "arch=icelake-server,arch=sapphirerapids,arch=cooperlake",
>  \
> "avx2", "default")))
> #endif

For example, I can do something like this, but gcc produces a ton of
unnecessary duplicates for each of the avx512dq architectures. There must be a
better way.

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929

--- Comment #32 from Chris Elrod  ---
Ha, I accidentally misreported my gcc version. I was already using 12.1.1.

Using x86-64-v4 worked, excellent! Thanks.

[Bug target/114276] New: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`

2024-03-07 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276

Bug ID: 114276
   Summary: Trapping on aligned operations when using vector
builtins + `-std=gnu++23 -fsanitize=address
-fstack-protector-strong`
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 57651
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57651&action=edit
test file

I'm not sure how to categorize the issue, so I picked "target" as it occurs for
x86_64 when using aligned moves on 64-byte avx512 vectors.

`-std=c++23` also reproduces the problem.
I am using:

> g++ --version
> g++ (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6)
> Copyright (C) 2023 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The attached file is:

> #include 
> #include 
> 
> template 
> using Vec [[gnu::vector_size(W * sizeof(T))]] = T;
> 
> auto foo() {
>   Vec<8, int64_t> ret{};
>   return ret;
> }
> 
> int main() {
>   foo();
>   return 0;
> }

I have attached this file.

On a skylake-avx512 CPU, I get

> g++ -std=gnu++23 -march=skylake-avx512 -fstack-protector-strong -O0 -g 
> -mprefer-vector-width=512 -fsanitize=address,undefined -fsanitize-trap=all 
> simdvecalign.cpp && ./a.out
AddressSanitizer:DEADLYSIGNAL
=
==36238==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp
0x7ffdf88a1cb0 sp 0x7ffdf88a1bc0 T0)
==36238==The signal is caused by a READ memory access.
==36238==Hint: this fault was caused by a dereference of a high value address
(see register values below).  Disassemble the provided pc to learn which
register was used.
#0 0x40125c in foo()
/home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8
#1 0x4012d1 in main
/home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:13
#2 0x7f296b846149 in __libc_start_call_main (/lib64/libc.so.6+0x28149)
(BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
#3 0x7f296b84620a in __libc_start_main_impl (/lib64/libc.so.6+0x2820a)
(BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
#4 0x4010a4 in _start
(/home/chriselrod/Documents/progwork/cxx/experiments/a.out+0x4010a4) (BuildId:
765272b0173968b14f4306c8d4a37fcb18733889)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV
/home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8 in foo()
==36238==ABORTING
fish: Job 1, './a.out' terminated by signal SIGABRT (Abort)

However, if I remove any of `-std=gnu++23`, `-fsantize=address`, or
`-fstack-protector-strong`, the code runs without a problem.

Using 32 byte vectors instead of 64 byte also allows it to work.

I also used `-S` to look at the assembly.

When I edit the two lines:
>   vmovdqa64   %zmm0, -128(%rdx)
>   .loc 1 9 10
>   vmovdqa64   -128(%rdx), %zmm0

swapping `vmovdqa64` for `vmovdqu64`, the code runs as intended.

> g++ -fsanitize=address simdvecalign.s # using vmovdqu64
> ./a.out
> g++ -fsanitize=address simdvecalign.s # reverted back to vmovdqa64
> ./a.out
AddressSanitizer:DEADLYSIGNAL
=
==40364==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp
0x7ffd2e2dc240 sp 0x7ffd2e2dc140 T0)

so I am inclined to think that something isn't guaranteeing that `%rdx` is
actually 64-byte aligned (but it may be 32-byte aligned, given that I can't
reproduce with 32 byte vectors).

[Bug target/114276] Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`

2024-03-07 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276

--- Comment #1 from Chris Elrod  ---
Created attachment 57652
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57652&action=edit
assembly from adding `-S`

[Bug target/110027] Misaligned vector store on detect_stack_use_after_return

2024-03-08 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #9 from Chris Elrod  ---
> Interestingly this seems to be only reproducible on Arch Linux. Other gcc 
> 13.1.1 builds, Fedora for instance, seem to behave correctly. 

I haven't tried that reproducer on Fedora with gcc 13.2.1, which could have
regressed since 13.1.1.
However, the dup example in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276
does reproduce on Fedora with gcc-13.2.1 once you add extra compile flags
`-std=c++23 -fstack-protector-strong`.
I'll try the original reproducer later, it may be the case that adding/removing
these flags fuzzes the alignment.

[Bug c++/111493] New: [concepts] multidimensional subscript operator inside requires is broken

2023-09-20 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493

Bug ID: 111493
   Summary: [concepts] multidimensional subscript operator inside
requires is broken
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

Two example programs:

>  #include 
>  constexpr auto foo(const auto &A, int i, int j)  
> requires(requires(decltype(A) a, int ii) { a[ii, ii]; }) {
>return A[i, j];
>  }
>  constexpr auto foo(const auto &A, int i, int j) {
>return A + i + j;
>  }
>  static_assert(foo(2,3,4) == 9);


>  #include 
>  template 
>  concept CartesianIndexable = requires(T t, int i) {
>{ t[i, i] } -> std::convertible_to;
>  };
>  static_assert(!CartesianIndexable);

These result in errors of the form

  error: invalid types 'const int[int]' for array subscript

Here is godbolt for reference: https://godbolt.org/z/WE66nY8zG

The invalid subscript should result in the `requires` failing, not an error.

[Bug c++/111493] [concepts] multidimensional subscript operator inside requires is broken

2023-09-20 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493

--- Comment #2 from Chris Elrod  ---
Note that it also shows up in gcc-13. I put gcc-14 as the version to indicate
that I confirmed it is still a problem on latest trunk. Not sure what the
policy is on which version we should report.

[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline

2024-05-05 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008

--- Comment #14 from Chris Elrod  ---
To me, an "inline" function is one that the compiler inlines.
It just happens that the `inline` keyword also means both comdat semantics, and
possibly hiding the symbol to make it internal (-fvisibility-inlines-hidden).
It also just happens to be the case that the vast majority of the time I mark a
function `inline`, it is because of this, not because of the compiler hint.
`static` of course also specifies internal linkage, but I generally prefer the
comdat semantics: I'd rather merge than duplicate the definitions.

If there is a new keyword or pragma meaning comdat semantics (and preferably
also specifying internal linkage), I would rather have the name reference that.

I'd rather have a positive name about what it does, than negative:
"quasi_inline: like inline, except it does everything inline does except the
inline part".
Why define as a set diff -- naming it after the thing it does not do! -- if you
could define it in the affirmative, based on what it does in the first place?

[Bug tree-optimization/112824] New: Stack spills and vector splitting with vector builtins

2023-12-02 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

Bug ID: 112824
   Summary: Stack spills and vector splitting with vector builtins
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

I am not sure which component to place this under, but selected
tree-optimization as I suspect this is some sort of alias analysis failure
preventing the removal of stack allocations.

Godbolt link, reproduces on GCC trunk and 13.2:
https://godbolt.org/z/4TPx17Mbn
Clang has similar problems in my actual test case, but they don't show up in
this minimal example I made. Although Clang isn't perfect here either: it fails
to fuse fmadd + masked vmovapd, while GCC does succeed in fusing them.

For reference, code behind the godbolt link is:

#include 
#include 
#include 
#include 

template 
using Vec [[gnu::vector_size(W * sizeof(T))]] = T;


// Omitted: 16 without AVX, 32 without AVX512F,
// or for forward compatibility some AVX10 may also mean 32-only
static constexpr ptrdiff_t VectorBytes = 64;
template
static constexpr ptrdiff_t VecWidth = 64 <= sizeof(T) ? 1 : 64/sizeof(T);

template  struct Vector{
static constexpr ptrdiff_t L = N;
T data[L];
static constexpr auto size()->ptrdiff_t{return N;}
};
template  struct Vector{
static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth :
ptrdiff_t(std::bit_ceil(size_t(N))); 
static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0);
using V = Vec;
V data[L];
static constexpr auto size()->ptrdiff_t{return N;}
};
/// should be trivially copyable
/// codegen is worse when passing by value, even though it seems like it should
make
/// aliasing simpler to analyze?
template
[[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y)
-> Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] +
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y)
-> Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] *
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator+(T x, Vector y) ->
Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(T x, Vector y) ->
Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n];
return z;
}



template  struct Dual {
  T value;
  Vector partials;
};
// Here we have a specialization for non-power-of-2 `N`
template  
requires(std::floating_point && (std::popcount(size_t(N))>1))
struct Dual {
  Vector data;
};


template
consteval auto firstoff(){
static_assert(std::same_as, "type not implemented");
if constexpr (W==2) return Vec<2,int64_t>{0,1} != 0;
else if constexpr (W == 4) return Vec<4,int64_t>{0,1,2,3} != 0;
else if constexpr (W == 8) return Vec<8,int64_t>{0,1,2,3,4,5,6,7} != 0;
else static_assert(false, "vector width not implemented");
}

template 
[[gnu::always_inline]] constexpr auto operator+(Dual a,
Dual b)
  -> Dual {
  if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){
Dual c;
for (ptrdiff_t l = 0; l < Vector::L; ++l)
  c.data.data[l] = a.data.data[l] + b.data.data[l]; 
return c;
  } else return {a.value + b.value, a.partials + b.partials};
}

template 
[[gnu::always_inline]] constexpr auto operator*(Dual a,
Dual b)
  -> Dual {
  if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){
using V = typename Vector::V;
V va = V{}+a.data.data[0][0], vb = V{}+b.data.data[0][0];
V x = va * b.data.data[0];
Dual c;
c.data.data[0] = firstoff::W,T>() ? x + vb*a.data.data[0] : x;
for (ptrdiff_t l = 1; l < Vector::L; ++l)
  c.data.data[l] = va*b.data.data[l] + vb*a.data.data[l]; 
return c;
  } else return {a.value * b.value, a.value * b.partials + b.value *
a.partials};
}

void prod(Dual,2> &c, const Dual,2> &a, const
Dual,2>&b){
c = a*b;
}
void prod(Dual,2> &c, const Dual,2> &a, const
Dual,2>&b){
c = a*b;
}


GCC 13.2 asm, when compiling with
-std=gnu++23 -march=skylake-avx512 -mprefer-vector-width=512 -O3


prod(Dual, 2l>&, Dual, 2l> const&,
Dual, 2l> const&):
pushrbp
mov eax, -2
kmovb   k1, eax
mov rbp, rsp
and rsp, -64
sub rsp, 264
vmovdqa ymm4, YMMWORD PTR [rsi+128]
vmovapd zmm8, ZMMWORD PTR [rsi]
vmovapd zmm9, ZMMWORD PTR [rdx]
vmovdqa ymm6, YMMWORD PTR [rsi+64]
vmovdqa YMMWORD PTR [rsp+8], ymm4
vmovdqa ymm4, YMMWORD PTR [rdx+96]
vbroadcastsdzmm0, xmm8
  

[Bug tree-optimization/112824] Stack spills and vector splitting with vector builtins

2023-12-02 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #1 from Chris Elrod  ---
Here I have added a godbolt example where I manually unroll the array, where
GCC generates excellent code https://godbolt.org/z/sd4bhGW7e
I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on
Skylake-X it is 38 uops for unrolled GCC with separate struct fields, vs 49
uops for Clang, vs 67 for GCC with arrays.
uica expects <14 clock cycles for the manually unrolled vs >23 for the array
version.

My experience so far with expression templates has born this out: compilers
seem to struggle with peeling away abstractions.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-03 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #2 from Chris Elrod  ---
https://godbolt.org/z/3648aMTz8

Perhaps a simpler diff is that you can reproduce by uncommenting the pragma,
but codegen becomes good with it.

template
constexpr auto operator*(OuterDualUA2 a, OuterDualUA2
b)->OuterDualUA2{  
  //return
{a.value*b.value,a.value*b.p[0]+b.value*a.p[0],a.value*b.p[1]+b.value*a.p[1]}; 
  OuterDualUA2 c;
  c.value = a.value*b.value;
#pragma GCC unroll 16
  for (ptrdiff_t i = 0; i < 2; ++i)
c.p[i] = a.value*b.p[i] + b.value*a.p[i];
  //c.p[0] = a.value*b.p[0] + b.value*a.p[0];
  //c.p[1] = a.value*b.p[1] + b.value*a.p[1];
  return c;
}


It's not great to have to add pragmas everywhere to my actual codebase. I
thought I hit the important cases, but my non-minimal example still gets
unnecessary register splits and stack spills, so maybe I missed places, or
perhaps there's another issue.

Given that GCC unrolls the above code even without the pragma, it seems like a
definite bug that the pragma is needed for the resulting code generation to
actually be good.
Not knowing the compiler pipeline, my naive guess is that the pragma causes
earlier unrolling than whatever optimization pass does it sans pragma, and that
some important analysis/optimization gets run between those two times.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-03 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #3 from Chris Elrod  ---
> I thought I hit the important cases, but my non-minimal example still gets 
> unnecessary register splits and stack spills, so maybe I missed places, or 
> perhaps there's another issue.

Adding the unroll pragma to the `Vector`'s operator + and *:

template
[[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y)
-> Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] +
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y)
-> Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] *
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator+(T x, Vector y) ->
Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(T x, Vector y) ->
Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n];
return z;
}


does not improve code generation (still get the same problem), so that's a
reproducer for such an issue.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-04 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #6 from Chris Elrod  ---
Hongtao Liu, I do think that one should ideally be able to get optimal codegen
when using 512-bit builtin vectors or vector intrinsics, without needing to set
`-mprefer-vector-width=512` (and, currently, also setting
`-mtune-ctrl=avx512_move_by_pieces`).

For example, if I remove `-mprefer-vector-width=512`, I get

prod(Dual, 2l>&, Dual, 2l> const&,
Dual, 2l> const&):
pushrbp
mov eax, -2
kmovb   k1, eax
mov rbp, rsp
and rsp, -64
sub rsp, 264
vmovdqa ymm4, YMMWORD PTR [rsi+128]
vmovapd zmm8, ZMMWORD PTR [rsi]
vmovapd zmm9, ZMMWORD PTR [rdx]
vmovdqa ymm6, YMMWORD PTR [rsi+64]
vmovdqa YMMWORD PTR [rsp+8], ymm4
vmovdqa ymm4, YMMWORD PTR [rdx+96]
vbroadcastsdzmm0, xmm8
vmovdqa ymm7, YMMWORD PTR [rsi+96]
vbroadcastsdzmm1, xmm9
vmovdqa YMMWORD PTR [rsp-56], ymm6
vmovdqa ymm5, YMMWORD PTR [rdx+128]
vmovdqa ymm6, YMMWORD PTR [rsi+160]
vmovdqa YMMWORD PTR [rsp+168], ymm4
vxorpd  xmm4, xmm4, xmm4
vaddpd  zmm0, zmm0, zmm4
vaddpd  zmm1, zmm1, zmm4
vmovdqa YMMWORD PTR [rsp-24], ymm7
vmovdqa ymm7, YMMWORD PTR [rdx+64]
vmovapd zmm3, ZMMWORD PTR [rsp-56]
vmovdqa YMMWORD PTR [rsp+40], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+160]
vmovdqa YMMWORD PTR [rsp+200], ymm5
vmulpd  zmm2, zmm0, zmm9
vmovdqa YMMWORD PTR [rsp+136], ymm7
vmulpd  zmm5, zmm1, zmm3
vbroadcastsdzmm3, xmm3
vmovdqa YMMWORD PTR [rsp+232], ymm6
vaddpd  zmm3, zmm3, zmm4
vmovapd zmm7, zmm2
vmovapd zmm2, ZMMWORD PTR [rsp+8]
vfmadd231pd zmm7{k1}, zmm8, zmm1
vmovapd zmm6, zmm5
vmovapd zmm5, ZMMWORD PTR [rsp+136]
vmulpd  zmm1, zmm1, zmm2
vfmadd231pd zmm6{k1}, zmm9, zmm3
vbroadcastsdzmm2, xmm2
vmovapd zmm3, ZMMWORD PTR [rsp+200]
vaddpd  zmm2, zmm2, zmm4
vmovapd ZMMWORD PTR [rdi], zmm7
vfmadd231pd zmm1{k1}, zmm9, zmm2
vmulpd  zmm2, zmm0, zmm5
vbroadcastsdzmm5, xmm5
vmulpd  zmm0, zmm0, zmm3
vbroadcastsdzmm3, xmm3
vaddpd  zmm5, zmm5, zmm4
vaddpd  zmm3, zmm3, zmm4
vfmadd231pd zmm2{k1}, zmm8, zmm5
vfmadd231pd zmm0{k1}, zmm8, zmm3
vaddpd  zmm2, zmm2, zmm6
vaddpd  zmm0, zmm0, zmm1
vmovapd ZMMWORD PTR [rdi+64], zmm2
vmovapd ZMMWORD PTR [rdi+128], zmm0
vzeroupper
leave
ret
prod(Dual, 2l>&, Dual, 2l> const&,
Dual, 2l> const&):
pushrbp
mov rbp, rsp
and rsp, -64
sub rsp, 648
vmovdqa ymm5, YMMWORD PTR [rsi+224]
vmovdqa ymm3, YMMWORD PTR [rsi+352]
vmovapd zmm0, ZMMWORD PTR [rdx+64]
vmovdqa ymm2, YMMWORD PTR [rsi+320]
vmovdqa YMMWORD PTR [rsp+104], ymm5
vmovdqa ymm5, YMMWORD PTR [rdx+224]
vmovdqa ymm7, YMMWORD PTR [rsi+128]
vmovdqa YMMWORD PTR [rsp+232], ymm3
vmovsd  xmm3, QWORD PTR [rsi]
vmovdqa ymm6, YMMWORD PTR [rsi+192]
vmovdqa YMMWORD PTR [rsp+488], ymm5
vmovdqa ymm4, YMMWORD PTR [rdx+192]
vmovapd zmm1, ZMMWORD PTR [rsi+64]
vbroadcastsdzmm5, xmm3
vmovdqa YMMWORD PTR [rsp+200], ymm2
vmovdqa ymm2, YMMWORD PTR [rdx+320]
vmulpd  zmm8, zmm5, zmm0
vmovdqa YMMWORD PTR [rsp+8], ymm7
vmovdqa ymm7, YMMWORD PTR [rsi+256]
vmovdqa YMMWORD PTR [rsp+72], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+128]
vmovdqa YMMWORD PTR [rsp+584], ymm2
vmovsd  xmm2, QWORD PTR [rdx]
vmovdqa YMMWORD PTR [rsp+136], ymm7
vmovdqa ymm7, YMMWORD PTR [rdx+256]
vmovdqa YMMWORD PTR [rsp+392], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+352]
vmulsd  xmm10, xmm3, xmm2
vmovdqa YMMWORD PTR [rsp+456], ymm4
vbroadcastsdzmm4, xmm2
vfmadd231pd zmm8, zmm4, zmm1
vmovdqa YMMWORD PTR [rsp+520], ymm7
vmovdqa YMMWORD PTR [rsp+616], ymm6
vmulpd  zmm9, zmm4, ZMMWORD PTR [rsp+72]
vmovsd  xmm6, QWORD PTR [rsp+520]
vmulpd  zmm4, zmm4, ZMMWORD PTR [rsp+200]
vmulpd  zmm11, zmm5, ZMMWORD PTR [rsp+456]
vmovsd  QWORD PTR [rdi], xmm10
vmulpd  zmm5, zmm5, ZMMWORD PTR [rsp+584]
vmovapd ZMMWORD PTR [rdi+64], zmm8
vfmadd231pd zmm9, zmm0, QWORD PTR [rsp+8]{1to8}
vfmadd231pd zmm4, zmm0, QWORD PTR [rsp+136]{1to8}
vmovsd  xmm0, QWORD PTR [rsp+392]
vmulsd  xmm7, xmm3, xmm0
vbroadcastsdzmm0, xmm0
vmulsd  xmm3, xmm3, xmm6
vfmadd132pd zmm0, zmm11, zmm1
vbroadcastsdzmm6, xmm6
vfmadd132pd zmm1, zmm5, zmm6
vfmadd231sd xmm7, xmm2, QWORD PTR [rsp+8]
vfmadd132sd  

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-04 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #8 from Chris Elrod  ---
> If it's designed the way you want it to be, another issue would be like, 
> should we lower 512-bit vector builtins/intrinsic to ymm/xmm when 
> -mprefer-vector-width=256, the answer is we'd rather not. 

To be clear, what I meant by

>  it would be great to respect
> `-mprefer-vector-width=512`, it should ideally also be able to respect
> vector builtins/intrinsics

is that when someone uses 512 bit vector builtins, that codegen should generate
512 bit code regardless of `mprefer-vector-width` settings.
That is, as a developer, I would want 512 bit builtins to mean we get 512-bit
vector code generation.

>  If user explicitly use 512-bit vector type, builtins or intrinsics, gcc will 
> generate zmm no matter -mprefer-vector-width=.

This is what I would want, and I'd also want it to apply to movement of
`struct`s holding vector builtin objects, instead of the `ymm` usage as we see
here.

> And yes, there could be some mismatches between 512-bit intrinsic and 
> architecture tuning when you're using 512-bit intrinsic, and also rely on 
> compiler autogen to handle struct
> For such case, an explicit -mprefer-vector-width=512 is needed.

Note the template partial specialization

template  struct Vector{
static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth :
ptrdiff_t(std::bit_ceil(size_t(N))); 
static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0);
using V = Vec;
V data[L];
static constexpr auto size()->ptrdiff_t{return N;}
};

Thus, `Vector`s in this example may explicitly be structs containing arrays of
vector builtins. I would expect these structs to not need an
`mprefer-vector-width=512` setting for producing 512 bit code handling this
struct.
Given small `L`, I would also expect passing this struct as an argument by
value to a non-inlined function to be done in `zmm` registers when possible,
for example.