[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 Chris Elrod changed: What|Removed |Added CC||elrodc at gmail dot com --- Comment #29 from Chris Elrod --- "RESOLVED FIXED". I haven't tried this with `target`, but avx512bw does not work with target_clones with gcc 11.2, but it does with clang 14.
[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 --- Comment #30 from Chris Elrod --- > #if defined(__clang__) > #define MULTIVERSION > \ > __attribute__((target_clones("avx512dq", "avx2", "default"))) > #else > #define MULTIVERSION > \ > __attribute__((target_clones( > \ > "arch=skylake-avx512,arch=cascadelake,arch=icelake-client,arch=" > \ > "tigerlake," > \ > "arch=icelake-server,arch=sapphirerapids,arch=cooperlake", > \ > "avx2", "default"))) > #endif For example, I can do something like this, but gcc produces a ton of unnecessary duplicates for each of the avx512dq architectures. There must be a better way.
[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 --- Comment #32 from Chris Elrod --- Ha, I accidentally misreported my gcc version. I was already using 12.1.1. Using x86-64-v4 worked, excellent! Thanks.
[Bug target/114276] New: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 Bug ID: 114276 Summary: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong` Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 57651 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57651&action=edit test file I'm not sure how to categorize the issue, so I picked "target" as it occurs for x86_64 when using aligned moves on 64-byte avx512 vectors. `-std=c++23` also reproduces the problem. I am using: > g++ --version > g++ (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6) > Copyright (C) 2023 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The attached file is: > #include > #include > > template > using Vec [[gnu::vector_size(W * sizeof(T))]] = T; > > auto foo() { > Vec<8, int64_t> ret{}; > return ret; > } > > int main() { > foo(); > return 0; > } I have attached this file. On a skylake-avx512 CPU, I get > g++ -std=gnu++23 -march=skylake-avx512 -fstack-protector-strong -O0 -g > -mprefer-vector-width=512 -fsanitize=address,undefined -fsanitize-trap=all > simdvecalign.cpp && ./a.out AddressSanitizer:DEADLYSIGNAL = ==36238==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp 0x7ffdf88a1cb0 sp 0x7ffdf88a1bc0 T0) ==36238==The signal is caused by a READ memory access. ==36238==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x40125c in foo() /home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8 #1 0x4012d1 in main /home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:13 #2 0x7f296b846149 in __libc_start_call_main (/lib64/libc.so.6+0x28149) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756) #3 0x7f296b84620a in __libc_start_main_impl (/lib64/libc.so.6+0x2820a) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756) #4 0x4010a4 in _start (/home/chriselrod/Documents/progwork/cxx/experiments/a.out+0x4010a4) (BuildId: 765272b0173968b14f4306c8d4a37fcb18733889) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8 in foo() ==36238==ABORTING fish: Job 1, './a.out' terminated by signal SIGABRT (Abort) However, if I remove any of `-std=gnu++23`, `-fsantize=address`, or `-fstack-protector-strong`, the code runs without a problem. Using 32 byte vectors instead of 64 byte also allows it to work. I also used `-S` to look at the assembly. When I edit the two lines: > vmovdqa64 %zmm0, -128(%rdx) > .loc 1 9 10 > vmovdqa64 -128(%rdx), %zmm0 swapping `vmovdqa64` for `vmovdqu64`, the code runs as intended. > g++ -fsanitize=address simdvecalign.s # using vmovdqu64 > ./a.out > g++ -fsanitize=address simdvecalign.s # reverted back to vmovdqa64 > ./a.out AddressSanitizer:DEADLYSIGNAL = ==40364==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp 0x7ffd2e2dc240 sp 0x7ffd2e2dc140 T0) so I am inclined to think that something isn't guaranteeing that `%rdx` is actually 64-byte aligned (but it may be 32-byte aligned, given that I can't reproduce with 32 byte vectors).
[Bug target/114276] Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 --- Comment #1 from Chris Elrod --- Created attachment 57652 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57652&action=edit assembly from adding `-S`
[Bug target/110027] Misaligned vector store on detect_stack_use_after_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #9 from Chris Elrod --- > Interestingly this seems to be only reproducible on Arch Linux. Other gcc > 13.1.1 builds, Fedora for instance, seem to behave correctly. I haven't tried that reproducer on Fedora with gcc 13.2.1, which could have regressed since 13.1.1. However, the dup example in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 does reproduce on Fedora with gcc-13.2.1 once you add extra compile flags `-std=c++23 -fstack-protector-strong`. I'll try the original reproducer later, it may be the case that adding/removing these flags fuzzes the alignment.
[Bug c++/111493] New: [concepts] multidimensional subscript operator inside requires is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493 Bug ID: 111493 Summary: [concepts] multidimensional subscript operator inside requires is broken Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Two example programs: > #include > constexpr auto foo(const auto &A, int i, int j) > requires(requires(decltype(A) a, int ii) { a[ii, ii]; }) { >return A[i, j]; > } > constexpr auto foo(const auto &A, int i, int j) { >return A + i + j; > } > static_assert(foo(2,3,4) == 9); > #include > template > concept CartesianIndexable = requires(T t, int i) { >{ t[i, i] } -> std::convertible_to; > }; > static_assert(!CartesianIndexable); These result in errors of the form error: invalid types 'const int[int]' for array subscript Here is godbolt for reference: https://godbolt.org/z/WE66nY8zG The invalid subscript should result in the `requires` failing, not an error.
[Bug c++/111493] [concepts] multidimensional subscript operator inside requires is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493 --- Comment #2 from Chris Elrod --- Note that it also shows up in gcc-13. I put gcc-14 as the version to indicate that I confirmed it is still a problem on latest trunk. Not sure what the policy is on which version we should report.
[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008 --- Comment #14 from Chris Elrod --- To me, an "inline" function is one that the compiler inlines. It just happens that the `inline` keyword also means both comdat semantics, and possibly hiding the symbol to make it internal (-fvisibility-inlines-hidden). It also just happens to be the case that the vast majority of the time I mark a function `inline`, it is because of this, not because of the compiler hint. `static` of course also specifies internal linkage, but I generally prefer the comdat semantics: I'd rather merge than duplicate the definitions. If there is a new keyword or pragma meaning comdat semantics (and preferably also specifying internal linkage), I would rather have the name reference that. I'd rather have a positive name about what it does, than negative: "quasi_inline: like inline, except it does everything inline does except the inline part". Why define as a set diff -- naming it after the thing it does not do! -- if you could define it in the affirmative, based on what it does in the first place?
[Bug tree-optimization/112824] New: Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 Bug ID: 112824 Summary: Stack spills and vector splitting with vector builtins Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- I am not sure which component to place this under, but selected tree-optimization as I suspect this is some sort of alias analysis failure preventing the removal of stack allocations. Godbolt link, reproduces on GCC trunk and 13.2: https://godbolt.org/z/4TPx17Mbn Clang has similar problems in my actual test case, but they don't show up in this minimal example I made. Although Clang isn't perfect here either: it fails to fuse fmadd + masked vmovapd, while GCC does succeed in fusing them. For reference, code behind the godbolt link is: #include #include #include #include template using Vec [[gnu::vector_size(W * sizeof(T))]] = T; // Omitted: 16 without AVX, 32 without AVX512F, // or for forward compatibility some AVX10 may also mean 32-only static constexpr ptrdiff_t VectorBytes = 64; template static constexpr ptrdiff_t VecWidth = 64 <= sizeof(T) ? 1 : 64/sizeof(T); template struct Vector{ static constexpr ptrdiff_t L = N; T data[L]; static constexpr auto size()->ptrdiff_t{return N;} }; template struct Vector{ static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth : ptrdiff_t(std::bit_ceil(size_t(N))); static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0); using V = Vec; V data[L]; static constexpr auto size()->ptrdiff_t{return N;} }; /// should be trivially copyable /// codegen is worse when passing by value, even though it seems like it should make /// aliasing simpler to analyze? template [[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] * y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator+(T x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(T x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n]; return z; } template struct Dual { T value; Vector partials; }; // Here we have a specialization for non-power-of-2 `N` template requires(std::floating_point && (std::popcount(size_t(N))>1)) struct Dual { Vector data; }; template consteval auto firstoff(){ static_assert(std::same_as, "type not implemented"); if constexpr (W==2) return Vec<2,int64_t>{0,1} != 0; else if constexpr (W == 4) return Vec<4,int64_t>{0,1,2,3} != 0; else if constexpr (W == 8) return Vec<8,int64_t>{0,1,2,3,4,5,6,7} != 0; else static_assert(false, "vector width not implemented"); } template [[gnu::always_inline]] constexpr auto operator+(Dual a, Dual b) -> Dual { if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){ Dual c; for (ptrdiff_t l = 0; l < Vector::L; ++l) c.data.data[l] = a.data.data[l] + b.data.data[l]; return c; } else return {a.value + b.value, a.partials + b.partials}; } template [[gnu::always_inline]] constexpr auto operator*(Dual a, Dual b) -> Dual { if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){ using V = typename Vector::V; V va = V{}+a.data.data[0][0], vb = V{}+b.data.data[0][0]; V x = va * b.data.data[0]; Dual c; c.data.data[0] = firstoff::W,T>() ? x + vb*a.data.data[0] : x; for (ptrdiff_t l = 1; l < Vector::L; ++l) c.data.data[l] = va*b.data.data[l] + vb*a.data.data[l]; return c; } else return {a.value * b.value, a.value * b.partials + b.value * a.partials}; } void prod(Dual,2> &c, const Dual,2> &a, const Dual,2>&b){ c = a*b; } void prod(Dual,2> &c, const Dual,2> &a, const Dual,2>&b){ c = a*b; } GCC 13.2 asm, when compiling with -std=gnu++23 -march=skylake-avx512 -mprefer-vector-width=512 -O3 prod(Dual, 2l>&, Dual, 2l> const&, Dual, 2l> const&): pushrbp mov eax, -2 kmovb k1, eax mov rbp, rsp and rsp, -64 sub rsp, 264 vmovdqa ymm4, YMMWORD PTR [rsi+128] vmovapd zmm8, ZMMWORD PTR [rsi] vmovapd zmm9, ZMMWORD PTR [rdx] vmovdqa ymm6, YMMWORD PTR [rsi+64] vmovdqa YMMWORD PTR [rsp+8], ymm4 vmovdqa ymm4, YMMWORD PTR [rdx+96] vbroadcastsdzmm0, xmm8
[Bug tree-optimization/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #1 from Chris Elrod --- Here I have added a godbolt example where I manually unroll the array, where GCC generates excellent code https://godbolt.org/z/sd4bhGW7e I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on Skylake-X it is 38 uops for unrolled GCC with separate struct fields, vs 49 uops for Clang, vs 67 for GCC with arrays. uica expects <14 clock cycles for the manually unrolled vs >23 for the array version. My experience so far with expression templates has born this out: compilers seem to struggle with peeling away abstractions.
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #2 from Chris Elrod --- https://godbolt.org/z/3648aMTz8 Perhaps a simpler diff is that you can reproduce by uncommenting the pragma, but codegen becomes good with it. template constexpr auto operator*(OuterDualUA2 a, OuterDualUA2 b)->OuterDualUA2{ //return {a.value*b.value,a.value*b.p[0]+b.value*a.p[0],a.value*b.p[1]+b.value*a.p[1]}; OuterDualUA2 c; c.value = a.value*b.value; #pragma GCC unroll 16 for (ptrdiff_t i = 0; i < 2; ++i) c.p[i] = a.value*b.p[i] + b.value*a.p[i]; //c.p[0] = a.value*b.p[0] + b.value*a.p[0]; //c.p[1] = a.value*b.p[1] + b.value*a.p[1]; return c; } It's not great to have to add pragmas everywhere to my actual codebase. I thought I hit the important cases, but my non-minimal example still gets unnecessary register splits and stack spills, so maybe I missed places, or perhaps there's another issue. Given that GCC unrolls the above code even without the pragma, it seems like a definite bug that the pragma is needed for the resulting code generation to actually be good. Not knowing the compiler pipeline, my naive guess is that the pragma causes earlier unrolling than whatever optimization pass does it sans pragma, and that some important analysis/optimization gets run between those two times.
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #3 from Chris Elrod --- > I thought I hit the important cases, but my non-minimal example still gets > unnecessary register splits and stack spills, so maybe I missed places, or > perhaps there's another issue. Adding the unroll pragma to the `Vector`'s operator + and *: template [[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] * y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator+(T x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(T x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n]; return z; } does not improve code generation (still get the same problem), so that's a reproducer for such an issue.
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #6 from Chris Elrod --- Hongtao Liu, I do think that one should ideally be able to get optimal codegen when using 512-bit builtin vectors or vector intrinsics, without needing to set `-mprefer-vector-width=512` (and, currently, also setting `-mtune-ctrl=avx512_move_by_pieces`). For example, if I remove `-mprefer-vector-width=512`, I get prod(Dual, 2l>&, Dual, 2l> const&, Dual, 2l> const&): pushrbp mov eax, -2 kmovb k1, eax mov rbp, rsp and rsp, -64 sub rsp, 264 vmovdqa ymm4, YMMWORD PTR [rsi+128] vmovapd zmm8, ZMMWORD PTR [rsi] vmovapd zmm9, ZMMWORD PTR [rdx] vmovdqa ymm6, YMMWORD PTR [rsi+64] vmovdqa YMMWORD PTR [rsp+8], ymm4 vmovdqa ymm4, YMMWORD PTR [rdx+96] vbroadcastsdzmm0, xmm8 vmovdqa ymm7, YMMWORD PTR [rsi+96] vbroadcastsdzmm1, xmm9 vmovdqa YMMWORD PTR [rsp-56], ymm6 vmovdqa ymm5, YMMWORD PTR [rdx+128] vmovdqa ymm6, YMMWORD PTR [rsi+160] vmovdqa YMMWORD PTR [rsp+168], ymm4 vxorpd xmm4, xmm4, xmm4 vaddpd zmm0, zmm0, zmm4 vaddpd zmm1, zmm1, zmm4 vmovdqa YMMWORD PTR [rsp-24], ymm7 vmovdqa ymm7, YMMWORD PTR [rdx+64] vmovapd zmm3, ZMMWORD PTR [rsp-56] vmovdqa YMMWORD PTR [rsp+40], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+160] vmovdqa YMMWORD PTR [rsp+200], ymm5 vmulpd zmm2, zmm0, zmm9 vmovdqa YMMWORD PTR [rsp+136], ymm7 vmulpd zmm5, zmm1, zmm3 vbroadcastsdzmm3, xmm3 vmovdqa YMMWORD PTR [rsp+232], ymm6 vaddpd zmm3, zmm3, zmm4 vmovapd zmm7, zmm2 vmovapd zmm2, ZMMWORD PTR [rsp+8] vfmadd231pd zmm7{k1}, zmm8, zmm1 vmovapd zmm6, zmm5 vmovapd zmm5, ZMMWORD PTR [rsp+136] vmulpd zmm1, zmm1, zmm2 vfmadd231pd zmm6{k1}, zmm9, zmm3 vbroadcastsdzmm2, xmm2 vmovapd zmm3, ZMMWORD PTR [rsp+200] vaddpd zmm2, zmm2, zmm4 vmovapd ZMMWORD PTR [rdi], zmm7 vfmadd231pd zmm1{k1}, zmm9, zmm2 vmulpd zmm2, zmm0, zmm5 vbroadcastsdzmm5, xmm5 vmulpd zmm0, zmm0, zmm3 vbroadcastsdzmm3, xmm3 vaddpd zmm5, zmm5, zmm4 vaddpd zmm3, zmm3, zmm4 vfmadd231pd zmm2{k1}, zmm8, zmm5 vfmadd231pd zmm0{k1}, zmm8, zmm3 vaddpd zmm2, zmm2, zmm6 vaddpd zmm0, zmm0, zmm1 vmovapd ZMMWORD PTR [rdi+64], zmm2 vmovapd ZMMWORD PTR [rdi+128], zmm0 vzeroupper leave ret prod(Dual, 2l>&, Dual, 2l> const&, Dual, 2l> const&): pushrbp mov rbp, rsp and rsp, -64 sub rsp, 648 vmovdqa ymm5, YMMWORD PTR [rsi+224] vmovdqa ymm3, YMMWORD PTR [rsi+352] vmovapd zmm0, ZMMWORD PTR [rdx+64] vmovdqa ymm2, YMMWORD PTR [rsi+320] vmovdqa YMMWORD PTR [rsp+104], ymm5 vmovdqa ymm5, YMMWORD PTR [rdx+224] vmovdqa ymm7, YMMWORD PTR [rsi+128] vmovdqa YMMWORD PTR [rsp+232], ymm3 vmovsd xmm3, QWORD PTR [rsi] vmovdqa ymm6, YMMWORD PTR [rsi+192] vmovdqa YMMWORD PTR [rsp+488], ymm5 vmovdqa ymm4, YMMWORD PTR [rdx+192] vmovapd zmm1, ZMMWORD PTR [rsi+64] vbroadcastsdzmm5, xmm3 vmovdqa YMMWORD PTR [rsp+200], ymm2 vmovdqa ymm2, YMMWORD PTR [rdx+320] vmulpd zmm8, zmm5, zmm0 vmovdqa YMMWORD PTR [rsp+8], ymm7 vmovdqa ymm7, YMMWORD PTR [rsi+256] vmovdqa YMMWORD PTR [rsp+72], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+128] vmovdqa YMMWORD PTR [rsp+584], ymm2 vmovsd xmm2, QWORD PTR [rdx] vmovdqa YMMWORD PTR [rsp+136], ymm7 vmovdqa ymm7, YMMWORD PTR [rdx+256] vmovdqa YMMWORD PTR [rsp+392], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+352] vmulsd xmm10, xmm3, xmm2 vmovdqa YMMWORD PTR [rsp+456], ymm4 vbroadcastsdzmm4, xmm2 vfmadd231pd zmm8, zmm4, zmm1 vmovdqa YMMWORD PTR [rsp+520], ymm7 vmovdqa YMMWORD PTR [rsp+616], ymm6 vmulpd zmm9, zmm4, ZMMWORD PTR [rsp+72] vmovsd xmm6, QWORD PTR [rsp+520] vmulpd zmm4, zmm4, ZMMWORD PTR [rsp+200] vmulpd zmm11, zmm5, ZMMWORD PTR [rsp+456] vmovsd QWORD PTR [rdi], xmm10 vmulpd zmm5, zmm5, ZMMWORD PTR [rsp+584] vmovapd ZMMWORD PTR [rdi+64], zmm8 vfmadd231pd zmm9, zmm0, QWORD PTR [rsp+8]{1to8} vfmadd231pd zmm4, zmm0, QWORD PTR [rsp+136]{1to8} vmovsd xmm0, QWORD PTR [rsp+392] vmulsd xmm7, xmm3, xmm0 vbroadcastsdzmm0, xmm0 vmulsd xmm3, xmm3, xmm6 vfmadd132pd zmm0, zmm11, zmm1 vbroadcastsdzmm6, xmm6 vfmadd132pd zmm1, zmm5, zmm6 vfmadd231sd xmm7, xmm2, QWORD PTR [rsp+8] vfmadd132sd
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #8 from Chris Elrod --- > If it's designed the way you want it to be, another issue would be like, > should we lower 512-bit vector builtins/intrinsic to ymm/xmm when > -mprefer-vector-width=256, the answer is we'd rather not. To be clear, what I meant by > it would be great to respect > `-mprefer-vector-width=512`, it should ideally also be able to respect > vector builtins/intrinsics is that when someone uses 512 bit vector builtins, that codegen should generate 512 bit code regardless of `mprefer-vector-width` settings. That is, as a developer, I would want 512 bit builtins to mean we get 512-bit vector code generation. > If user explicitly use 512-bit vector type, builtins or intrinsics, gcc will > generate zmm no matter -mprefer-vector-width=. This is what I would want, and I'd also want it to apply to movement of `struct`s holding vector builtin objects, instead of the `ymm` usage as we see here. > And yes, there could be some mismatches between 512-bit intrinsic and > architecture tuning when you're using 512-bit intrinsic, and also rely on > compiler autogen to handle struct > For such case, an explicit -mprefer-vector-width=512 is needed. Note the template partial specialization template struct Vector{ static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth : ptrdiff_t(std::bit_ceil(size_t(N))); static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0); using V = Vec; V data[L]; static constexpr auto size()->ptrdiff_t{return N;} }; Thus, `Vector`s in this example may explicitly be structs containing arrays of vector builtins. I would expect these structs to not need an `mprefer-vector-width=512` setting for producing 512 bit code handling this struct. Given small `L`, I would also expect passing this struct as an argument by value to a non-inlined function to be done in `zmm` registers when possible, for example.