[Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899 Bug ID: 95899 Summary: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 48784 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S dot.c -o dot.s Sample code: ``` double dot(double* a, double* b, long N){ double s = 0.0; for (long n = 0; n < N; n++){ s += a[n] * b[n]; } return s; } ``` Relevant part of the asm: ``` .L4: vmovupd (%rdi,%r11), %zmm8 vmovupd 64(%rdi,%r11), %zmm9 vfmadd231pd (%rsi,%r11), %zmm8, %zmm0 vmovupd 128(%rdi,%r11), %zmm10 vmovupd 192(%rdi,%r11), %zmm11 vmovupd 256(%rdi,%r11), %zmm12 vmovupd 320(%rdi,%r11), %zmm13 vfmadd231pd 64(%rsi,%r11), %zmm9, %zmm0 vmovupd 384(%rdi,%r11), %zmm14 vmovupd 448(%rdi,%r11), %zmm15 vfmadd231pd 128(%rsi,%r11), %zmm10, %zmm0 vfmadd231pd 192(%rsi,%r11), %zmm11, %zmm0 vfmadd231pd 256(%rsi,%r11), %zmm12, %zmm0 vfmadd231pd 320(%rsi,%r11), %zmm13, %zmm0 vfmadd231pd 384(%rsi,%r11), %zmm14, %zmm0 vfmadd231pd 448(%rsi,%r11), %zmm15, %zmm0 addq$512, %r11 cmpq%r8, %r11 jne .L4 ``` Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of 4 cycles. Because each unrolled instance accumulates into `%zmm0`, we are limited by the dependency chain to 1 fma every 4 cycles. It should use separate accumulators. Additionally, if the loads are aligned, it would have a throughput of 2 loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per cycle. If the dependency chain were the primary motivation for unrolling, we'd only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4 simultaneous / OoO fmas. Something like a sum (1 load per add) would perform better with the 8x unrolling seen here (at least, from 100 or so elements until it becomes memory bound).
[Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899 --- Comment #2 from Chris Elrod --- Interesting. Compiling with: gcc -march=native -fvariable-expansion-in-unroller -Ofast -funroll-loops -S dot.c -o dot.s Yields: ``` .L4: vmovupd (%rdi,%r11), %zmm9 vmovupd 64(%rdi,%r11), %zmm10 vfmadd231pd (%rsi,%r11), %zmm9, %zmm0 vfmadd231pd 64(%rsi,%r11), %zmm10, %zmm1 vmovupd 128(%rdi,%r11), %zmm11 vmovupd 192(%rdi,%r11), %zmm12 vmovupd 256(%rdi,%r11), %zmm13 vfmadd231pd 128(%rsi,%r11), %zmm11, %zmm0 vfmadd231pd 192(%rsi,%r11), %zmm12, %zmm1 vmovupd 320(%rdi,%r11), %zmm14 vmovupd 384(%rdi,%r11), %zmm15 vmovupd 448(%rdi,%r11), %zmm4 vfmadd231pd 256(%rsi,%r11), %zmm13, %zmm0 vfmadd231pd 320(%rsi,%r11), %zmm14, %zmm1 vfmadd231pd 384(%rsi,%r11), %zmm15, %zmm0 vfmadd231pd 448(%rsi,%r11), %zmm4, %zmm1 addq$512, %r11 cmpq%r8, %r11 jne .L4 ``` So the dependency chain has now been split in 2. 4 would be ideal. I'll try running benchmarks later to see how it does. FWIW, the original ran at between 20 and 25 GFLOPS from roughly N = 80 through N = 1024. The fastest versions I benchmarked climbed from around 20 to 50 GFLOPS over this range. So perhaps just splitting the dependency once can get it much of the way there. Out of curiosity, what's the reason for this being off by default for everything but ppc? Seems like it should turned on with `-funroll-loops`, given that breaking dependency chains are one of the primary ways unrolling can actually help performance.
[Bug fortran/88713] New: _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 Bug ID: 88713 Summary: _gfortran_internal_pack@PLT prevents vectorization Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 45350 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45350&action=edit Fortran version of vectorization test. I am attaching Fortran and C++ translations of a simple working example. The C++ version is vectorized, while the Fortran version is not. The code consists of two functions. One simply runs a for loop, calling the other function. The function is vectorizable across loop iterations. g++ does this succcesfully. However, gfortran does not, because it repacks data with call_gfortran_internal_pack@PLT so that it can no longer be vectorized across iterations. I compiled with: gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC -S vectorization_test.cpp -o gfortvectorization_test.s g++ -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC -S vectorization_test.cpp -o gppvectorization_test.s LLVM (via flang and clang) successfully vectorizes both versions.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #1 from Chris Elrod --- Created attachment 45351 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45351&action=edit C++ version of the vectorization test case.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #2 from Chris Elrod --- Created attachment 45352 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45352&action=edit gfortran assembly output
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #3 from Chris Elrod --- Created attachment 45353 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45353&action=edit g++ assembly output
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #6 from Chris Elrod --- Created attachment 45356 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356&action=edit Code to demonstrate that transposing makes things slower. Thomas Koenig, I am well aware that Fortran is column major. That is precisely why I chose the memory layout I did. Benchmark of the "optimal" corrected code: @benchmark gforttest($X32t, $BPP32t, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 20.647 μs (0.00% GC) median time: 20.860 μs (0.00% GC) mean time:21.751 μs (0.00% GC) maximum time: 47.760 μs (0.00% GC) -- samples: 1 evals/sample: 1 Here is a benchmark (compiling with Flang) of my code, exactly as written (suboptimal) in the attachments: @benchmark flangtest($X32, $BPP32, $N) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 658.329 ns (0.00% GC) median time: 668.012 ns (0.00% GC) mean time:692.384 ns (0.00% GC) maximum time: 1.192 μs (0.00% GC) -- samples: 1 evals/sample: 161 That is 20 microseconds, vs 670 nanoseconds. N was 1024, and the exact same data used in both cases (but pretransposed, so I do not benchmark tranposing). Benchmarking was done by compiling shared libraries, and using `ccall` and BenchmarkTools from Julia. As indicated by the reports, the benchmark was run 10,000 times for gfortran, and 1.61 million times for Flang, to get accurate timings. I compiled with (march=native is equivalent to march=skylake-avx512): gfortran -Ofast -march=native -mprefer-vector-width=512 -fno-semantic-interposition -shared -fPIC vectorization_test_transposed.f90 -o libgfortvectorization_test.so flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC vectorization_test.f90 -o libflangvectorization_test.so Flang was built with LLVM 7.0.1. The "suboptimal" code was close to 32 times faster than the "optimal" code. I was expecting it to be closer to 16 times faster, given the vector width. To go into more detail: " Fortran lays out the memory for that array as BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2) so you are accessing your memory with a stride of n in the expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient anyway, vectorization would not really help in this case. " Yes, each call to fpdbacksolve is accessing memory across strides. But fpdbacksolve itself cannot be vectorized well at all. What does work, however, is vectorizing across loop iterations. For example, imagine calling fpdbacksolve on this: BPP(1:16,1), BPP(1:16,2), BPP(1:16,3), BPP(1:16,5), ..., BPP(1:16,10) and then performing every single scalar operation defined in fpdbacksolve on an entire SIMD vector of floats (that is, on 16 floats) at a time. That would of course require inlining fpdbacksolve (which was achieved with -fno-semantic-interposition, as the assembly shows), and recompiling it. Perhaps another way you can imagine it is that fpdbacksolve takes in 9 numbers (BPP(:,4) was unused), and returns 3 numbers. Because operations within it aren't vectorizable, we want to vectorize it ACROSS loop iterations, not within them. So to facilitate that, we have 9 vectors of contiguous inputs, and 3 vectors of contiguous outputs. Now, all inputs1 are stored contiguously, as are all inputs2, etc..., allowing the inputs to efficiently be loaded into SIMD registers, and each loop iteration to calculate [SIMD vector width] of the outputs at a time. Of course, it is inconvenient to handle a dozen vectors. So if they all have the same length, we can just concatenate them together. I'll attach the assembly of both code examples as well. The assembly makes it clear that the "suboptimal" way was vectorized, and the "optimal" way was not. The benchmarks make it resoundingly clear that the vectorized ("suboptimal") version was dramatically faster. As is, this is a missed optimization, and gfortran is severely falling behind in performance versus LLVM-based Flang in the highest performance version of the code.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #7 from Chris Elrod --- Created attachment 45357 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45357&action=edit Assembly generated by Flang compiler on the original version of the code. This is the main loop body in the Flang compiled version of the original code (starts line 132): .LBB1_8:# %vector.body # =>This Inner Loop Header: Depth=1 leaq(%rsi,%rbx,4), %r12 vmovups (%rcx,%r12), %zmm2 addq%rcx, %r12 leaq(%r12,%rcx), %rbp vmovups (%r11,%rbp), %zmm3 addq%r11, %rbp leaq(%rcx,%rbp), %r13 leaq(%rcx,%r13), %r8 leaq(%r8,%rcx), %r10 leaq(%r10,%rcx), %r14 vmovups (%rcx,%r14), %zmm4 vrsqrt14ps %zmm4, %zmm5 vmulps %zmm5, %zmm4, %zmm4 vfmadd213ps %zmm0, %zmm5, %zmm4 # zmm4 = (zmm5 * zmm4) + zmm0 vmulps %zmm1, %zmm5, %zmm5 vmulps %zmm4, %zmm5, %zmm4 .Ltmp1: .loc1 31 1 is_stmt 1# vectorization_test.f90:31:1 vmulps (%rcx,%r8), %zmm4, %zmm5 .loc1 32 1 # vectorization_test.f90:32:1 vmulps (%rcx,%r10), %zmm4, %zmm6 vmovups (%rcx,%r13), %zmm7 .loc1 33 1 # vectorization_test.f90:33:1 vfnmadd231ps%zmm6, %zmm6, %zmm7 # zmm7 = -(zmm6 * zmm6) + zmm7 vrsqrt14ps %zmm7, %zmm8 vmulps %zmm8, %zmm7, %zmm7 vfmadd213ps %zmm0, %zmm8, %zmm7 # zmm7 = (zmm8 * zmm7) + zmm0 vmulps %zmm1, %zmm8, %zmm8 vmulps %zmm7, %zmm8, %zmm7 vmovups (%rcx,%rbp), %zmm8 .loc1 35 1 # vectorization_test.f90:35:1 vfnmadd231ps%zmm5, %zmm6, %zmm8 # zmm8 = -(zmm6 * zmm5) + zmm8 vmulps %zmm8, %zmm7, %zmm8 vmulps %zmm5, %zmm5, %zmm9 vfmadd231ps %zmm8, %zmm8, %zmm9 # zmm9 = (zmm8 * zmm8) + zmm9 vsubps %zmm9, %zmm3, %zmm3 vrsqrt14ps %zmm3, %zmm9 vmulps %zmm9, %zmm3, %zmm3 vfmadd213ps %zmm0, %zmm9, %zmm3 # zmm3 = (zmm9 * zmm3) + zmm0 vmulps %zmm1, %zmm9, %zmm9 vmulps %zmm3, %zmm9, %zmm3 .loc1 39 1 # vectorization_test.f90:39:1 vmulps %zmm8, %zmm7, %zmm8 .loc1 40 1 # vectorization_test.f90:40:1 vmulps (%rcx,%r12), %zmm4, %zmm4 .loc1 39 1 # vectorization_test.f90:39:1 vmulps %zmm3, %zmm8, %zmm8 .loc1 41 1 # vectorization_test.f90:41:1 vmulps %zmm8, %zmm2, %zmm9 vfmsub231ps (%rsi,%rbx,4), %zmm3, %zmm9 # zmm9 = (zmm3 * mem) - zmm9 vmulps %zmm5, %zmm3, %zmm3 vfmsub231ps %zmm8, %zmm6, %zmm3 # zmm3 = (zmm6 * zmm8) - zmm3 vfmadd213ps %zmm9, %zmm4, %zmm3 # zmm3 = (zmm4 * zmm3) + zmm9 .loc1 42 1 # vectorization_test.f90:42:1 vmulps %zmm4, %zmm6, %zmm5 vmulps %zmm5, %zmm7, %zmm5 vfmsub231ps %zmm7, %zmm2, %zmm5 # zmm5 = (zmm2 * zmm7) - zmm5 .Ltmp2: .loc1 15 1 # vectorization_test.f90:15:1 vmovups %zmm3, (%rdi,%rbx,4) movq-16(%rsp), %rbp # 8-byte Reload vmovups %zmm5, (%rbp,%rbx,4) vmovups %zmm4, (%rax,%rbx,4) addq$16, %rbx cmpq%rbx, %rdx jne .LBB1_8 zmm registers are 64 byte registers. It vmovups from memory into registers, performs a series of arithmetics and inverse square roots on them, and then vmovups three of these 64 byte registers back into memory. That is the most efficient memory access pattern (as demonstrated empirically via benchmarks).
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #8 from Chris Elrod --- Created attachment 45358 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358&action=edit gfortran compiled assembly for the tranposed version of the original code. Here is the assembly for the loop body of the transposed version of the code, compiled by gfortran: .L8: vmovss 36(%rsi), %xmm0 addq$40, %rsi vrsqrtss%xmm0, %xmm2, %xmm2 addq$12, %rdi vmulss %xmm0, %xmm2, %xmm0 vmulss %xmm2, %xmm0, %xmm0 vmulss %xmm7, %xmm2, %xmm2 vaddss %xmm8, %xmm0, %xmm0 vmulss %xmm2, %xmm0, %xmm0 vmulss -8(%rsi), %xmm0, %xmm5 vmulss -12(%rsi), %xmm0, %xmm4 vmulss -32(%rsi), %xmm0, %xmm0 vmovaps %xmm5, %xmm3 vfnmadd213ss-16(%rsi), %xmm5, %xmm3 vmovaps %xmm4, %xmm2 vfnmadd213ss-20(%rsi), %xmm5, %xmm2 vmovss %xmm0, -4(%rdi) vrsqrtss%xmm3, %xmm1, %xmm1 vmulss %xmm3, %xmm1, %xmm3 vmulss %xmm1, %xmm3, %xmm3 vmulss %xmm7, %xmm1, %xmm1 vaddss %xmm8, %xmm3, %xmm3 vmulss %xmm1, %xmm3, %xmm3 vmulss %xmm3, %xmm2, %xmm6 vmovaps %xmm4, %xmm2 vfnmadd213ss-24(%rsi), %xmm4, %xmm2 vfnmadd231ss%xmm6, %xmm6, %xmm2 vrsqrtss%xmm2, %xmm10, %xmm10 vmulss %xmm2, %xmm10, %xmm1 vmulss %xmm10, %xmm1, %xmm1 vmulss %xmm7, %xmm10, %xmm10 vaddss %xmm8, %xmm1, %xmm1 vmulss %xmm10, %xmm1, %xmm1 vmulss %xmm1, %xmm3, %xmm2 vmulss %xmm6, %xmm2, %xmm2 vmovss -36(%rsi), %xmm6 vxorps %xmm9, %xmm2, %xmm2 vmulss %xmm6, %xmm2, %xmm10 vmulss %xmm2, %xmm5, %xmm2 vfmadd231ss -40(%rsi), %xmm1, %xmm10 vfmadd132ss %xmm4, %xmm2, %xmm1 vfnmadd132ss%xmm0, %xmm10, %xmm1 vmulss %xmm0, %xmm5, %xmm0 vmovss %xmm1, -12(%rdi) vsubss %xmm0, %xmm6, %xmm0 vmulss %xmm3, %xmm0, %xmm3 vmovss %xmm3, -8(%rdi) cmpq%rsi, %rax jne .L8 While Flang had a second loop of scalar code (to catch the N mod [SIMD vector width] remainder of the vectorized loop), there are no secondary loops in the gfortran code, meaning these must all be scalar operations (I have a hard time telling apart SSE from scalar code...). It looks similar in the operations it performs to Flang's vectorized loop, except that it is only performing operations on a single number at a time. Because to get efficient vectorization, we need corresponding elements to be contiguous (ie, all the input1s, all the input2s). We do not get any benefit from having all the different elements with the same index (the first input1 next to the first input2, next to the first input3...) being contiguous. The memory layout I used is performance-optimal, but is something that gfortran unfortunately often cannot handle automatically (without manual unrolling). This is why I filed a report on bugzilla.
[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #10 from Chris Elrod --- (In reply to Thomas Koenig from comment #9) > Hm. > > It would help if your benchmark was complete, so I could run it. > I don't suppose you happen to have and be familiar with Julia? If you (or someone else here is), I'll attach the code to generate the fake data (the most important point is that columns 5:10 of BPP are the upper triangle of a 3x3 symmetric positive definite matrix). I have also already written a manually unrolled version that gfortran likes.. But I could write Fortran code to create an executable and run benchmarks. What are best practices? system_clock? (In reply to Thomas Koenig from comment #9) > > However, what happens if you put int > > real, dimension(:) :: Uix > real, dimension(:), intent(in) :: x > real, dimension(:), intent(in) :: S > > ? > > gfortran should not pack then. You're right! I wasn't able to follow this exactly, because it didn't want me to defer shape on Uix. Probably because it needs to compile a version of fpdbacksolve that can be called from the shared library? Interestingly, with that change, Flang failed to vectorize the code, but gfortran did. Compilers are finicky. Flang, original: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 655.827 ns (0.00% GC) median time: 665.698 ns (0.00% GC) mean time:689.967 ns (0.00% GC) maximum time: 1.061 μs (0.00% GC) -- samples: 1 evals/sample: 162 Flang, not specifying shape: # assembly shows it is using xmm BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 8.086 μs (0.00% GC) median time: 8.315 μs (0.00% GC) mean time:8.591 μs (0.00% GC) maximum time: 20.299 μs (0.00% GC) -- samples: 1 evals/sample: 3 gfortran, transposed version (not vectorizable): BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 20.643 μs (0.00% GC) median time: 20.901 μs (0.00% GC) mean time:21.441 μs (0.00% GC) maximum time: 54.103 μs (0.00% GC) -- samples: 1 evals/sample: 1 gfortran, not specifying shape: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 1.290 μs (0.00% GC) median time: 1.316 μs (0.00% GC) mean time:1.347 μs (0.00% GC) maximum time: 4.562 μs (0.00% GC) -- samples: 1 evals/sample: 10 Assembly confirms it is using zmm registers (but this time is much too fast not to be vectorized, anyway). For why gfortran is still slower than the Flang version, here is the loop body: .L16: vmovups (%r10,%rax), %zmm0 vcmpps $4, %zmm0, %zmm4, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm5, %zmm2, %zmm2 vaddps %zmm6, %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm8 vmulps %zmm0, %zmm8, %zmm0 vmulps %zmm0, %zmm8, %zmm0 vaddps %zmm8, %zmm8, %zmm8 vsubps %zmm0, %zmm8, %zmm8 vmulps (%r8,%rax), %zmm8, %zmm9 vmulps (%r9,%rax), %zmm8, %zmm10 vmulps (%r12,%rax), %zmm8, %zmm8 vmovaps %zmm9, %zmm3 vfnmadd213ps0(%r13,%rax), %zmm9, %zmm3 vcmpps $4, %zmm3, %zmm4, %k1 vrsqrt14ps %zmm3, %zmm2{%k1}{z} vmulps %zmm3, %zmm2, %zmm3 vmulps %zmm2, %zmm3, %zmm1 vmulps %zmm5, %zmm3, %zmm3 vaddps %zmm6, %zmm1, %zmm1 vmulps %zmm3, %zmm1, %zmm1 vmovaps %zmm9, %zmm3 vfnmadd213ps(%rdx,%rax), %zmm10, %zmm3 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm11 vmulps %zmm11, %zmm3, %zmm12 vmovaps %zmm10, %zmm3 vfnmadd213ps(%r14,%rax), %zmm10, %zmm3 vfnmadd231ps%zmm12, %zmm12, %zmm3 vcmpps $4, %zmm3, %zmm4, %k1 vrsqrt14ps %zmm3, %zmm1{%k1}{z} vmulps %zmm3, %zmm1, %zmm3 vmulps %zmm1, %zmm3, %zmm0 vmulps %zmm5, %zmm3, %zmm3 vmovups (%rcx,%rax), %zmm1 vaddps %zmm6, %zmm0, %zmm0 vmulps %zmm3, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm2 vmulps %zmm0, %zmm2, %zmm0 vmulps %zmm0, %zmm2, %zmm0 vaddps %zmm2, %zmm2, %zmm2 vsubps %zmm0, %zmm2, %zmm0 vmulps %zmm0, %zmm11, %zmm3 vmulps %zmm12, %zmm3, %zmm3 vxorps %zmm7, %zmm3, %zmm3 vmulps %zmm1, %zmm3, %zmm2 vmulps %zmm3, %zmm9, %zmm3 vfnmadd231ps%zmm8, %zmm9, %zmm1 vfmadd231p
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #12 from Chris Elrod --- Created attachment 45363 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363&action=edit Fortran program for running benchmarks. Okay, thank you. I attached a Fortran program you can run to benchmark the code. It randomly generates valid inputs, and then times running the code 10^5 times. Finally, it reports the average time in microseconds. The SIMD times are the vectorized version, and the transposed times are the non-vectorized versions. In both cases, Flang produces much faster code. The results seem in line with what I got benchmarking shared libraries from Julia. I linked rt for access to the high resolution clock. $ gfortran -Ofast -lrt -march=native -mprefer-vector-width=512 vectorization_tests.F90 -o gfortvectests $ time ./gfortvectests Transpose benchmark completed in 22.7799759 SIMD benchmark completed in 1.34003162 All are equal: F All are approximately equal: F Maximum relative error 8.27204276E-05 First record X: 1.02466011 -0.689792156 -0.404027045 First record Xt: 1.02465975 -0.689791918 -0.404026985 Second record X: -0.546353579 3.37308086E-03 1.15257287 Second record Xt: -0.546353400 3.37312138E-03 1.15257275 real0m2.418s user0m2.412s sys 0m0.003s $ flang -Ofast -lrt -march=native -mprefer-vector-width=512 vectorization_tests.F90 -o flangvectests $ time ./flangvectests Transpose benchmark completed in7.232568 SIMD benchmark completed in 0.6596010 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.58675421.568364 0.1006735 First record Xt: 0.58675411.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real0m0.801s user0m0.794s sys 0m0.005s
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #14 from Chris Elrod --- It's not really reproducible across runs: $ time ./gfortvectests Transpose benchmark completed in 22.7010765 SIMD benchmark completed in 1.37529969 All are equal: F All are approximately equal: F Maximum relative error 6.20566949E-04 First record X: 0.188879877 0.377619117 -1.67841911E-02 First record Xt: 0.10071 0.377619147 -1.67841911E-02 Second record X: -8.14126506E-02 -0.421755224 -0.199057430 Second record Xt: -8.14126655E-02 -0.421755224 -0.199057430 real0m2.414s user0m2.406s sys 0m0.005s $ time ./flangvectests Transpose benchmark completed in7.630980 SIMD benchmark completed in 0.6455200 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.58675421.568364 0.1006735 First record Xt: 0.58675411.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real0m0.839s user0m0.832s sys 0m0.006s $ time ./gfortvectests Transpose benchmark completed in 22.0195961 SIMD benchmark completed in 1.36087596 All are equal: F All are approximately equal: F Maximum relative error 2.49150675E-04 First record X: -0.284217566 2.13768221E-02 -0.475293010 First record Xt: -0.284217596 2.13767942E-02 -0.475293040 Second record X: 1.75664220E-02 -9.29893106E-02 -4.37139049E-02 Second record Xt: 1.75664220E-02 -9.29893106E-02 -4.37139049E-02 real0m2.344s user0m2.338s sys 0m0.003s $ time ./flangvectests Transpose benchmark completed in7.881181 SIMD benchmark completed in 0.6132510 All are equal: F All are approximately equal: F Maximum relative error 2.0917827E-04 First record X: 0.58675421.568364 0.1006735 First record Xt: 0.58675411.568363 0.1006735 Second record X: 0.2894785 -0.1510675 -9.3419194E-02 Second record Xt: 0.2894785 -0.1510675 -9.3419187E-02 real0m0.861s user0m0.853s sys 0m0.006s It's also probably wasn't quite right to call it "error", because it's comparing the values from the scalar and vectorized versions. Although it is unsettling if the differences are high; there should be an exact match, ideally. Back to Julia, using mpfr (set to 252 bits of precision), and rounding to single precision for an exactly rounded answer... X32gfort # calculated from gfortran X32flang # calculated from flang Xbf # mpfr, 252-bit precision ("BigFloat" in Julia) julia> Xbf32 = Float32.(Xbf) # correctly rounded result julia> function ULP(x, correct) # calculates ULP error x == correct && return 0 if x < correct error = 1 while nextfloat(x, error) != correct error += 1 end else error = 1 while prevfloat(x, error) != correct error += 1 end end error end ULP (generic function with 1 method) julia> ULP.(X32gfort, Xbf32)' 3×1024 Adjoint{Int64,Array{Int64,2}}: 7 1 1 8 3 2 1 1 1 27 4 1 4 6 0 0 2 0 2 4 0 7 1 1 3 8 4 2 2 … 1 0 2 0 0 1 2 3 1 5 1 1 0 0 0 2 3 2 1 2 3 1 0 1 1 0 2 0 41 4 2 1 1 6 1 0 1 1 2 2 0 0 3 0 1 0 3 1 1 0 1 1 0 0 3 1 0 0 0 1 0 1 0 1 0 1 1 4 1 1 0 2 0 1 0 1 0 0 0 1 2 1 1 1 0 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 1 julia> mean(ans) 1.9462890625 julia> ULP.(X32flang, Xbf32)' 3×1024 Adjoint{Int64,Array{Int64,2}}: 4 1 0 3 0 0 0 1 1 5 2 1 1 6 3 0 1 0 0 1 1 21 0 1 2 8 2 3 0 0 … 1 1 1 15 2 1 1 5 1 1 1 0 0 0 0 0 2 1 3 1 1 1 1 1 1 1 0 11 3 1 1 0 1 0 0 1 0 0 1 0 0 2 1 1 1 6 0 0 0 2 1 0 1 4 1 1 0 3 1 1 1 1 2 1 1 0 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 julia> mean(ans) 1.3388671875 So in that case, gfortran's version had about 1.95 ULP error on average, and Flang about 1.34 ULP error.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #18 from Chris Elrod --- I can confirm that the inlined packing does allow gfortran to vectorize the loop. So allowing packing to inline does seem (to me) like an optimization well worth making. However, performance seems to be about the same as before, still close to 2x slower than Flang. There is definitely something interesting going on in Flang's SLP vectorization, though. I defined the function: #ifndef VECTORWIDTH #define VECTORWIDTH 16 #endif subroutine vpdbacksolve(Uix, x, S) real, dimension(VECTORWIDTH,3) :: Uix real, dimension(VECTORWIDTH,3), intent(in) :: x real, dimension(VECTORWIDTH,6), intent(in) :: S real, dimension(VECTORWIDTH):: U11, U12, U22, U13, U23, U33, & Ui11, Ui12, Ui22, Ui33 U33 = sqrt(S(:,6)) Ui33 = 1 / U33 U13 = S(:,4) * Ui33 U23 = S(:,5) * Ui33 U22 = sqrt(S(:,3) - U23**2) Ui22 = 1 / U22 U12 = (S(:,2) - U13*U23) * Ui22 U11 = sqrt(S(:,1) - U12**2 - U13**2) Ui11 = 1 / U11 ! u11 Ui12 = - U12 * Ui11 * Ui22 ! u12 Uix(:,3) = Ui33*x(:,3) Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) * Uix(:,3) Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3) end subroutine vpdbacksolve in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling. I wanted to modify the Fortran file to benchmark these, but I'm pretty sure Flang cheated in the benchmarks. So compiling into a shared library, and benchmarking from Julia: julia> @benchmark flangvtest($Uix, $x, $S) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 15.104 ns (0.00% GC) median time: 15.563 ns (0.00% GC) mean time:16.017 ns (0.00% GC) maximum time: 49.524 ns (0.00% GC) -- samples: 1 evals/sample: 998 julia> @benchmark gfortvtest($Uix, $x, $S) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -- minimum time: 24.394 ns (0.00% GC) median time: 24.562 ns (0.00% GC) mean time:25.600 ns (0.00% GC) maximum time: 58.652 ns (0.00% GC) -- samples: 1 evals/sample: 996 That is over 60% faster for Flang, which would account for much, but not all, of the runtime difference in the actual for loops. For comparison, the vectorized loop in processbpp covers 16 samples per iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations. For the three gfortran benchmarks (that averaged 100,000 runs of the loop), that means each loop iteration averaged at about 1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64) 21.230246197916664 For flang, that was: 1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64) 9.99152083334 so we have about 21 vs 10 ns for the loop body in gfortran vs Flang, respectively. Comparing the asm between: 1. Flang processbpp loop body 2. Flang vpdbacksolve 3. gfortran processbpp loop body 4. gfortran vpdbacksolve Here are a few things I notice. 1. gfortran always uses masked reciprocal square root operations, to make sure it only takes the square root of non-negative (positive?) numbers: vxorps %xmm5, %xmm5, %xmm5 ... vmovups (%rsi,%rax), %zmm0 vmovups 0(%r13,%rax), %zmm9 vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} This might be avx512f specific? Either way, Flang does not use masks: vmovups (%rcx,%r14), %zmm4 vrsqrt14ps %zmm4, %zmm5 I'm having a hard time finding any information on what the performance impact of this may be. Agner Fog's instruction tables, for example, don't mention mask arguments for vrsqrt14ps. 2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop iteration (fpdbacksolve's arguments are a vector of length 3 and another of length 6; it returns a vector of length 3). gfortran's loop body has 3 unnecessary vmovaps, copying register contents. gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register contents. Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple unnecessary memory accesses. Ouch! They also moved on/off (the stack?) vmovaps %zmm2, .BSS4+192(%rip) ... vmovaps %zmm5, .BSS4+320(%rip) ... vmovaps .BSS4+192(%rip), %zmm5 ... #zmm5 is overwritten in here, I just mean to show the sort of stuff that goes on vmulps .BSS4+320(%rip), %zmm5, %zmm0 Some of those moves also don't get used again, and some other things are just plain weird: vxorps %xmm3, %xmm3, %xmm3 vfnmsub231ps%zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3 vmovaps %zmm3, .BSS4+576(%rip) Like, why zero out the 128 bit portion of zmm3 ? I verifie
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #19 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the Newton step, the answers are correct. My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding the Newton step. They get the correct answer. That leaves my best guess for the performance difference as owing to the masked "vrsqrt14ps" that gcc is using: vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} Is there any way for me to test that idea? Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and benchmark it?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #20 from Chris Elrod --- To add a little more: I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in Julia. Without adding a Newton step, the answers are wrong beyond just a couple significant digits. With the Newton step, the answers are correct. My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding the Newton step. They get the correct answer. That leaves my best guess for the performance difference as owing to the masked "vrsqrt14ps" that gcc is using (g++ does this too): vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} Is there any way for me to test that idea? Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and benchmark it? Okay, I just tried playing around with flags and looking at asm. I compiled with: g++ -O3 -ffinite-math-only -fexcess-precision=fast -fno-math-errno -fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math -fno-rounding-math -fno-signaling-nans -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgppvectorization_test.so vectorization_test.cpp which is basically all flags implied by "-ffast-math", except "-funsafe-math-optimizations". This does include the flags implied by the unsafe-math optimizations, just not that flag itself. This list can be simplified to (only "-fno-math-errno" is needed): g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgppvectorization_test.so vectorization_test.cpp or gfortran -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o libgfortvectorization_test.so vectorization_test.f90 This results in the following: vsqrtps (%r8,%rax), %zmm0 vdivps %zmm0, %zmm7, %zmm0 ie, vsqrt and a division, rather than the masked reciprocal square root. With N = 2827, that speeds gfortran and g++ from about 4.3 microseconds to 3.5 microseconds. For comparison, Clang takes about 2 microseconds, and Flang/ispc/and awful looking unsafe Rust take 2.3-2.4 microseconds, using the vrsqrt14ps (without a mask) and a Newton step, instead of vsqrtps followed by a division. So, "-funsafe-math-optimizations" results in a regression here.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #22 from Chris Elrod --- Okay. I did that, and the time went from about 4.25 microseconds down to 4.0 microseconds. So that is an improvement, but accounts for only a small part of the difference with the LLVM-compilers. -O3 -fno-math-errno was about 3.5 microseconds, so -funsafe-math-optimizations still results in a regression in this code. 3.5 microseconds is roughly as fast as you can get with vsqrt and div. My best guess now is that gcc does a lot more to improve the accuracy of vsqrt. If I understand correctly, these are all the involved instructions: vmovaps .LC2(%rip), %zmm7 vmovaps .LC3(%rip), %zmm6 # for loop begins vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm2, %zmm0, %zmm1 vmulps %zmm6, %zmm0, %zmm0 vaddps %zmm7, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm1 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 vfnmadd213ps(%r10,%rax), %zmm0, %zmm2 If I understand this correctly: zmm2 =(approx) 1 / sqrt(zmm1) zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1) zmm1 = zmm0 * zmm2 = (approx) 1 zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1) zmm1 = zmm7 * zmm1 = (approx) constant7 zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1) zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) zmm1 = zmm1 * zmm0 = (approx) 1 zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7) zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7) zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7) which implies that constant6 * constant6 = approximately -1? LLVM seems to do a much simpler / briefer update of the output of vrsqrt. When I implemented a vrsqrt intrinsic in a Julia library, I just looked at Wikipedia and did (roughly): constant1 = -0.5 constant2 = 1.5 zmm2 = (approx) 1 / sqrt(zmm1) zmm3 = constant * zmm1 zmm1 = zmm2 * zmm2 zmm3 = zmm3 * zmm1 + constant2 zmm2 = zmm2 * zmm3 I am not a numerical analyst, so I can't comment on relative validities or accuracies of these approaches. I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt. I would be interesting in reading explanations or discussions, if any are available.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #24 from Chris Elrod --- The dump looks like this: vect__67.78_217 = SQRT (vect__213.77_225); vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 } / vect__67.78_217; vect__71.80_249 = vect__246.59_65 * vect_ui33_68.79_248; vect_u13_73.81_250 = vect__187.71_14 * vect_ui33_68.79_248; vect_u23_75.82_251 = vect__200.74_5 * vect_ui33_68.79_248; so the vrsqrt optimization happens later. g++ shows the same problems with weird code generation. However this: /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ does not match this: vrsqrt14ps %zmm1, %zmm2 # comparison and mask removed vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm2, %zmm0, %zmm1 vmulps %zmm6, %zmm0, %zmm0 vaddps %zmm7, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm1 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 Recommendations on the next place to look for what's going on?
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #26 from Chris Elrod --- > You can try enabling -mrecip to see RSQRT in .optimized - there's > probably late 1/sqrt optimization on RTL. No luck. The full commands I used: gfortran -Ofast -mrecip -S -fdump-tree-optimized -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o gfortvectorizationdump.s vectorization_test.f90 g++ -mrecip -Ofast -fdump-tree-optimized -S -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o gppvectorization_test.s vectorization_test.cpp g++'s output was similar: vect_U33_60.31_372 = SQRT (vect_S33_59.30_371); vect_Ui33_61.32_374 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 } / vect_U33_60.31_372; vect_U13_62.33_375 = vect_S13_47.24_359 * vect_Ui33_61.32_374; vect_U23_63.34_376 = vect_S23_53.27_365 * vect_Ui33_61.32_374; and it has the same assembly as gfortran for the rsqrt: vcmpps $4, %zmm0, %zmm5, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm6, %zmm2, %zmm2 vaddps %zmm7, %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm10 vmulps %zmm0, %zmm10, %zmm0 vmulps %zmm0, %zmm10, %zmm0 vaddps %zmm10, %zmm10, %zmm10 vsubps %zmm0, %zmm10, %zmm10
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #27 from Chris Elrod --- g++ -mrecip=all -O3 -fno-signed-zeros -fassociative-math -freciprocal-math -fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S -march=native -shared -fPIC -mprefer-vector-width=512 -fno-semantic-interposition -o gppvectorization_test.s vectorization_test.cpp is not enough to get vrsqrt. I need -funsafe-math-optimizations for the instruction to appear in the asm.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #28 from Chris Elrod --- Created attachment 45501 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501&action=edit Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s I attached a minimum working example, demonstrating the problem of excessive code generation for reciprocal square root, in the file rsqrt.c. You can compile with: gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s clang -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o rsqrt.s Or compare the asm of both on Godbolt: https://godbolt.org/z/c7Z0En For gcc: vmovups (%rsi), %zmm0 vxorps %xmm1, %xmm1, %xmm1 vcmpps $4, %zmm0, %zmm1, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps .LC1(%rip), %zmm2, %zmm2 vaddps .LC0(%rip), %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps%zmm0, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmulps %zmm0, %zmm1, %zmm0 vaddps %zmm1, %zmm1, %zmm1 vsubps %zmm0, %zmm1, %zmm0 vmovups %zmm0, (%rdi) for Clang: vmovups (%rsi), %zmm0 vrsqrt14ps %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm0 vfmadd213ps .LCPI0_0(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 * zmm0) + mem vmulps .LCPI0_1(%rip){1to16}, %zmm1, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmovups %zmm0, (%rdi) Clang looks like it is is doing /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ where .LCPI0_0(%rip) = -3.0 and LCPI0_1(%rip) = -0.5. gcc is doing much more, and fairly different.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #30 from Chris Elrod --- gcc still (In reply to Marc Glisse from comment #29) > The main difference I can see is that clang computes rsqrt directly, while > gcc first computes sqrt and then computes the inverse. Also gcc seems afraid > of getting NaN for sqrt(0) so it masks out this value. ix86_emit_swsqrtsf in > gcc/config/i386/i386.c seems like a good place to look at. gcc caclulates the rsqrt directly with funsafe-math-optimizations and a couple other flags (or just -ffast-math): vmovups (%rsi), %zmm0 vxorps %xmm1, %xmm1, %xmm1 vcmpps $4, %zmm0, %zmm1, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z}
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #32 from Chris Elrod --- (In reply to Marc Glisse from comment #31) > (In reply to Chris Elrod from comment #30) > > gcc caclulates the rsqrt directly > > No, vrsqrt14ps is just the first step in calculating sqrt here (slightly > different formula than rsqrt). vrcp14ps shows that it is computing an > inverse later. What we need to understand is why gcc doesn't try to generate > rsqrt (which would also have vrsqrt14ps, but a slightly different formula > without the comparison with 0 and masking, and without needing an inversion > afterwards). Okay, I think I follow you. You're saying instead of doing this (from rguenther), which we want (also without the comparison to 0 and masking, as you note): /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ it is doing this, which also uses the rsqrt instruction: /* sqrt(a) = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */ and then calculating an inverse approximation of that? The approximate sqrt, and then approximate reciprocal approximations were slower on my computer than just vsqrt followed by div.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #35 from Chris Elrod --- > rsqrt: > .LFB12: > .cfi_startproc > vrsqrt28ps (%rsi), %zmm0 > vmovups %zmm0, (%rdi) > vzeroupper > ret > > (huh? isn't there a NR step missing?) > I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't considered necessary. Unfortunately, march=skylake-avx512 does not have -mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step. I think vrsqrt14pd/s are -mavx512f or -mavx512vl > Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without > that I don't know how the machinery can guess how to use rsqrt (there are > probably ways). Looking at the asm from only r[i] = sqrtf(a[i]): vmovups (%rsi), %zmm1 vxorps %xmm0, %xmm0, %xmm0 vcmpps $4, %zmm1, %zmm0, %k1 vrsqrt14ps %zmm1, %zmm0{%k1}{z} vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm0, %zmm1, %zmm0 vmulps .LC1(%rip), %zmm1, %zmm1 vaddps .LC0(%rip), %zmm0, %zmm0 vmulps %zmm1, %zmm0, %zmm0 vmovups %zmm0, (%rdi) vs the asm from only r[i] = 1 /a[i]: vmovups (%rsi), %zmm1 vrcp14ps%zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm0 vmovups %zmm0, (%rdi) it looks like the expander is there for sqrt, and for inverse, and we're just getting both one after the other. So it does look like I could benchmark which one is slower than the regular instruction on my platform, if that would be useful.
[Bug tree-optimization/88713] Vectorized code slow vs. flang
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713 --- Comment #54 from Chris Elrod --- I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches (attached here) and Thomas Koenig's inlining patches. With these patches, g++ and all versions of the Fortran code produced excellent asm, and the code performed excellently in benchmarks. Once those are merged, the problems reported here will be solved. I saw Thomas Koenig's packing changes will wait for gcc-10. What about H.J.Lu's fixes to rsqrt and allowing FMA use in those sections?
[Bug rtl-optimization/86625] New: funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 Bug ID: 86625 Summary: funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling Product: gcc Version: 8.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- I wasn't sure where to put this. I posted in the Fortran gcc mailing list initially, but was redirected to bugzilla. I specified RTL-optimization as the component, because the manually unrolled version avoids register spills yet has 13 (unnecessary?) vmovapd instructions between registers, and the loop version is a behemoth of moving data in, out, and between registers. The failure of the loop might also fall under tree optimization? For that reason, completely unrolling the loop actually results in over 3x less assembly than the loop. Unfortunately, funroll-loops did not complete unroll, making the manual unrolling necessary. Assembly is identical whether or not funroll-loops is used. Adding the directive: !GCC$ unroll 31 does lead to complete unrolling, but also use of xmm registers instead of zmm, and thus massive amounts of spilling (and probably extremely slow code -- did not benchmark). Here is the code (a 16x32 * 32x14 matrix multiplication kernel for avx-512 [the 32 is arbitrary]), sans directive: https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90 I compiled with: gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s resulting in this assembly (without the directive): https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s The manually unrolled version has 13 vmovapd instructions that look unnecessary (like a vfmadd should've been able to place the answer in the correct location?). 8 of them move from one register to another, and 5 look something like: vmovapd%zmm20, 136(%rsp) I suspect there should ideally be 0 of these? If not, I'd be interested in learning more about why. This at least seems like an RTL optimization bug/question. The rest of the generated code looks great to me. Repeated blocks of only: 2x vmovupd 7x vbroadcastsd 14x vfmadd231pd In the looped code, however, the `vfmadd231pd` instructions are a rare sight between all the register management. The loop code begins at line 1475 in the assembly file. While the manually unrolled code benchmarked at 135ns, the looped version took 1.4 microseconds on my computer. Trying to understand more about what it's doing: - While the manually unrolled code has the expected 868 = (16/8)*(32-1)*14 vfmadds for the fully unrolled code, the looped version has two blocks of 224 = (16/8)*X*14, where X = 8, indicating it is partially unrolling the loop. One of them is using xmm registers instead of zmm, so it looks like the compiler mistakenly things smaller vectors may be needed to clean up something? (Maybe it is trying to vectorize across loop iterations, rather than within, in some weird way? I don't know why it'd be using all those vpermt2pd, otherwise.)
[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #2 from Chris Elrod --- Created attachment 44418 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44418&action=edit Code to reproduce slow vectorization pattern and unnecessary loads & stores (Sorry if this goes to the bottom instead of top, trying to attach a file in place of a link, but I can't edit the old comment.) Attached is sample code to reproduce the problem in gcc 8.1.1 As observed by amonakov, compiling with -O3/-Ofast reproduces the full problem, eg: gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops -S kernels.f90 -o kernels.s Compiling with -O3 -fdisable-tree-cunrolli or -O2 -ftree-vectorize fixes the incorrect vectorization pattern, but leave a lot of unnecessary broadcast loads and stores.
[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #4 from Chris Elrod --- Created attachment 44423 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423&action=edit 8x16 * 16x6 kernel for avx2. Here is a scaled down version to reproduce most of the the problem for avx2-capable architectures. I just used march=haswell, but I think most recent architectures fall under this. For some, like zenv1, you may need to add -mprefer-vector-width=256. To get the inefficiently vectorized loop: gfortran -march=haswell -Ofast -shared -fPIC -S kernelsavx2.f90 -o kernelsavx2bad.s To get only the unnecessary loads/stores, use: gfortran -march=haswell -O2 -ftree-vectorize -shared -fPIC -S kernelsavx2.f90 -o kernelsavx2.s This file compiles instantly, while with `O3` the other one can take a couple seconds. However while it does `vmovapd` between registers, it no longer spills into the stack in the manually unrolled version, like the avx512 kernel does.
[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #5 from Chris Elrod --- Created attachment 44424 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44424&action=edit Smaller avx512 kernel that still spills into the stack This generated 18 total `vmovapd` (I think there'd ideally be 0) when compiled with: gfortran -march=skylake-avx512 -mprefer-vector-width=512 -O2 -ftree-vectorize -shared -fPIC -S kernels16x32x13.f90 -o kernels16x32x13.s 4 of which moved onto the stack, and one moved from the stack back into a register. (The others were transfered from the stack within vfmadd instructions: `vfmadd213pd72(%rsp), %zmm11, %zmm15` ) Similar to the larger kernel, using `-O3` instead of `-O2 -ftree-vectorize` eliminated two of the `vmovapd`instructions between registers, but none of the spills.
[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #6 from Chris Elrod --- (In reply to Richard Biener from comment #3) > If you see spilling on the manually unrolled loop register pressure is > somehow an issue. In the matmul kernel: D = A * X where D is 16x14, A is 16xN, and X is Nx14 (N arbitrarily set to 32) The code holds all of D in registers. 16x14 doubles, and 8 doubles per register mean 28 of the 32 registers. Then, it loads 1 column of A at a time (2 more registers), and broadcasts elements from the corresponding row in each column of X, updating the corresponding column of D with fma instructions. By broadcasting 2 at a time, it should be using exactly 32 registers. For the most part, that is precisely what the manually unrolled code is doing for each column of A. However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the 32 columns of A, it suddenly spills (all the stack accesses happen for the same column, and none of the others), even though the process is identical for each column. Switching to a smaller 16x13 output, freeing up 2 registers to allow 4 broadcast loads at a time, still resulted in 4 spills (down from 5) for only column #23 or #25. I couldn't reproduce the spills in the avx2 kernel. The smaller kernel has an 8x6 output, taking up 12 registers. Again leaving 4 total registers, 2 for a column of A, and 2 broadcasts from X at a time. So it's the same pattern. The smaller kernel does reproduce the problems with the loops. Both -O3 without `-fdisable-tree-cunrolli` leading to a slow vectorization scheme, and with it or `-O2 -ftree-vectorize` producing repetitive loads and stores within the loop.
[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625 --- Comment #7 from Chris Elrod --- (In reply to Chris Elrod from comment #6) > However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the > 32 columns of A Correction: it was the 16x13 version that used stack data after loading column 25 instead of 23 of A.
[Bug fortran/57992] Pointless packing of contiguous arrays for simply contiguous functions results as actual arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992 Chris Elrod changed: What|Removed |Added CC||elrodc at gmail dot com --- Comment #3 from Chris Elrod --- Created attachment 45014 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45014&action=edit Code that produces lots of unnecessary and performance-crippling _gfortran_internal_pack@PLT and _gfortran_internal_unpack@PLT Code that produces lots of unnecessary and performance-crippling _gfortran_internal_pack@PLT and _gfortran_internal_unpack@PLT: I compiled with: ``` gfortran -S -Ofast -fno-repack-arrays -fdisable-tree-cunrolli -fno-semantic-interposition -march=skylake-avx512 -mprefer-vector-width=512 -mveclibabi=svml -shared -fPIC -finline-limit=8192 gfortran_internal_pack_test.f90 -o gfortran_internal_pack_test.s ``` using $ gfortran --version GNU Fortran (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
[Bug fortran/57992] Pointless packing of contiguous arrays for simply contiguous functions results as actual arguments
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992 --- Comment #4 from Chris Elrod --- Created attachment 45016 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45016&action=edit Assembly from compiling gfortran_internal_pack_test.f90 The code takes in sets of 3-length vectors and 3x3 symmetric positive definite matrices (storing only the upper triangle). These are stored across columns. That is, element 1 of the first and second vectors are stored contiguously, while elements 1 and 2 of each vector are stride apart. The goal is to factor each PD matrix into S = U*U' (not the Cholesky), and then computes U^{-1} * x. There is a function that operates on one vector and matrix at a time (pdbacksolve). Another function operates on blocks of 16 at a time (vpdbacksolve). Three versions of functions operate on these: Version 1 simply loops over the inputs, calling the scalar version. Version 2 loops over blocks of 16 at a time, calling the blocked version. Version 3 manually inlined the function into the do loop. I used compiler options to ensure that all the functions were inlined into callers, so that ideally Version 2 and Version 3 would be identical. Attached assembly shows that they are not. Letting N = 1024 total vectors and matrices, on my computer Version 1 takes 97 microseconds to run, version 2 35 microseconds, and version 3 1.4 microseconds. These differences are dramatic! Version 1 failed to vectorize and was littered with _gfortran_internal_pack@PLT and _gfortran_internal_unpack@PLT. Version 2 vectorized, but also had all the pack/unpacks. Version 3 had neither. Data layout was the same (and optimal for vectorization) in all three cases. [Also worth pointing out that without -fdisable-tree-cunrolli, version 3 takes 9 microseconds.] For what it is worth, ifort takes 0.82, 1.5, and 0.88 microseconds respectively. I'd hope it is possible for gfortran's version 1 and 2 to match it's version 3 (1.4 microseconds) rather than being 70x and 25 slower. 1.4 microseconds is a good time, and the best I managed to achieve with explicit vectorization in Julia. I could file a different bug report, because the failed vectorization of version 1 is probably a different issue. But this is another example of unnecessary packing/unpacking.
[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 Chris Elrod changed: What|Removed |Added CC||elrodc at gmail dot com --- Comment #29 from Chris Elrod --- "RESOLVED FIXED". I haven't tried this with `target`, but avx512bw does not work with target_clones with gcc 11.2, but it does with clang 14.
[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 --- Comment #30 from Chris Elrod --- > #if defined(__clang__) > #define MULTIVERSION > \ > __attribute__((target_clones("avx512dq", "avx2", "default"))) > #else > #define MULTIVERSION > \ > __attribute__((target_clones( > \ > "arch=skylake-avx512,arch=cascadelake,arch=icelake-client,arch=" > \ > "tigerlake," > \ > "arch=icelake-server,arch=sapphirerapids,arch=cooperlake", > \ > "avx2", "default"))) > #endif For example, I can do something like this, but gcc produces a ton of unnecessary duplicates for each of the avx512dq architectures. There must be a better way.
[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929 --- Comment #32 from Chris Elrod --- Ha, I accidentally misreported my gcc version. I was already using 12.1.1. Using x86-64-v4 worked, excellent! Thanks.
[Bug target/114276] New: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 Bug ID: 114276 Summary: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong` Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Created attachment 57651 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57651&action=edit test file I'm not sure how to categorize the issue, so I picked "target" as it occurs for x86_64 when using aligned moves on 64-byte avx512 vectors. `-std=c++23` also reproduces the problem. I am using: > g++ --version > g++ (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6) > Copyright (C) 2023 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. The attached file is: > #include > #include > > template > using Vec [[gnu::vector_size(W * sizeof(T))]] = T; > > auto foo() { > Vec<8, int64_t> ret{}; > return ret; > } > > int main() { > foo(); > return 0; > } I have attached this file. On a skylake-avx512 CPU, I get > g++ -std=gnu++23 -march=skylake-avx512 -fstack-protector-strong -O0 -g > -mprefer-vector-width=512 -fsanitize=address,undefined -fsanitize-trap=all > simdvecalign.cpp && ./a.out AddressSanitizer:DEADLYSIGNAL = ==36238==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp 0x7ffdf88a1cb0 sp 0x7ffdf88a1bc0 T0) ==36238==The signal is caused by a READ memory access. ==36238==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x40125c in foo() /home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8 #1 0x4012d1 in main /home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:13 #2 0x7f296b846149 in __libc_start_call_main (/lib64/libc.so.6+0x28149) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756) #3 0x7f296b84620a in __libc_start_main_impl (/lib64/libc.so.6+0x2820a) (BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756) #4 0x4010a4 in _start (/home/chriselrod/Documents/progwork/cxx/experiments/a.out+0x4010a4) (BuildId: 765272b0173968b14f4306c8d4a37fcb18733889) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV /home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8 in foo() ==36238==ABORTING fish: Job 1, './a.out' terminated by signal SIGABRT (Abort) However, if I remove any of `-std=gnu++23`, `-fsantize=address`, or `-fstack-protector-strong`, the code runs without a problem. Using 32 byte vectors instead of 64 byte also allows it to work. I also used `-S` to look at the assembly. When I edit the two lines: > vmovdqa64 %zmm0, -128(%rdx) > .loc 1 9 10 > vmovdqa64 -128(%rdx), %zmm0 swapping `vmovdqa64` for `vmovdqu64`, the code runs as intended. > g++ -fsanitize=address simdvecalign.s # using vmovdqu64 > ./a.out > g++ -fsanitize=address simdvecalign.s # reverted back to vmovdqa64 > ./a.out AddressSanitizer:DEADLYSIGNAL = ==40364==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp 0x7ffd2e2dc240 sp 0x7ffd2e2dc140 T0) so I am inclined to think that something isn't guaranteeing that `%rdx` is actually 64-byte aligned (but it may be 32-byte aligned, given that I can't reproduce with 32 byte vectors).
[Bug target/114276] Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 --- Comment #1 from Chris Elrod --- Created attachment 57652 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57652&action=edit assembly from adding `-S`
[Bug target/110027] Misaligned vector store on detect_stack_use_after_return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027 --- Comment #9 from Chris Elrod --- > Interestingly this seems to be only reproducible on Arch Linux. Other gcc > 13.1.1 builds, Fedora for instance, seem to behave correctly. I haven't tried that reproducer on Fedora with gcc 13.2.1, which could have regressed since 13.1.1. However, the dup example in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276 does reproduce on Fedora with gcc-13.2.1 once you add extra compile flags `-std=c++23 -fstack-protector-strong`. I'll try the original reproducer later, it may be the case that adding/removing these flags fuzzes the alignment.
[Bug c++/111493] New: [concepts] multidimensional subscript operator inside requires is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493 Bug ID: 111493 Summary: [concepts] multidimensional subscript operator inside requires is broken Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- Two example programs: > #include > constexpr auto foo(const auto &A, int i, int j) > requires(requires(decltype(A) a, int ii) { a[ii, ii]; }) { >return A[i, j]; > } > constexpr auto foo(const auto &A, int i, int j) { >return A + i + j; > } > static_assert(foo(2,3,4) == 9); > #include > template > concept CartesianIndexable = requires(T t, int i) { >{ t[i, i] } -> std::convertible_to; > }; > static_assert(!CartesianIndexable); These result in errors of the form error: invalid types 'const int[int]' for array subscript Here is godbolt for reference: https://godbolt.org/z/WE66nY8zG The invalid subscript should result in the `requires` failing, not an error.
[Bug c++/111493] [concepts] multidimensional subscript operator inside requires is broken
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493 --- Comment #2 from Chris Elrod --- Note that it also shows up in gcc-13. I put gcc-14 as the version to indicate that I confirmed it is still a problem on latest trunk. Not sure what the policy is on which version we should report.
[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008 --- Comment #14 from Chris Elrod --- To me, an "inline" function is one that the compiler inlines. It just happens that the `inline` keyword also means both comdat semantics, and possibly hiding the symbol to make it internal (-fvisibility-inlines-hidden). It also just happens to be the case that the vast majority of the time I mark a function `inline`, it is because of this, not because of the compiler hint. `static` of course also specifies internal linkage, but I generally prefer the comdat semantics: I'd rather merge than duplicate the definitions. If there is a new keyword or pragma meaning comdat semantics (and preferably also specifying internal linkage), I would rather have the name reference that. I'd rather have a positive name about what it does, than negative: "quasi_inline: like inline, except it does everything inline does except the inline part". Why define as a set diff -- naming it after the thing it does not do! -- if you could define it in the affirmative, based on what it does in the first place?
[Bug tree-optimization/112824] New: Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 Bug ID: 112824 Summary: Stack spills and vector splitting with vector builtins Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: elrodc at gmail dot com Target Milestone: --- I am not sure which component to place this under, but selected tree-optimization as I suspect this is some sort of alias analysis failure preventing the removal of stack allocations. Godbolt link, reproduces on GCC trunk and 13.2: https://godbolt.org/z/4TPx17Mbn Clang has similar problems in my actual test case, but they don't show up in this minimal example I made. Although Clang isn't perfect here either: it fails to fuse fmadd + masked vmovapd, while GCC does succeed in fusing them. For reference, code behind the godbolt link is: #include #include #include #include template using Vec [[gnu::vector_size(W * sizeof(T))]] = T; // Omitted: 16 without AVX, 32 without AVX512F, // or for forward compatibility some AVX10 may also mean 32-only static constexpr ptrdiff_t VectorBytes = 64; template static constexpr ptrdiff_t VecWidth = 64 <= sizeof(T) ? 1 : 64/sizeof(T); template struct Vector{ static constexpr ptrdiff_t L = N; T data[L]; static constexpr auto size()->ptrdiff_t{return N;} }; template struct Vector{ static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth : ptrdiff_t(std::bit_ceil(size_t(N))); static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0); using V = Vec; V data[L]; static constexpr auto size()->ptrdiff_t{return N;} }; /// should be trivially copyable /// codegen is worse when passing by value, even though it seems like it should make /// aliasing simpler to analyze? template [[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] * y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator+(T x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(T x, Vector y) -> Vector { Vector z; for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n]; return z; } template struct Dual { T value; Vector partials; }; // Here we have a specialization for non-power-of-2 `N` template requires(std::floating_point && (std::popcount(size_t(N))>1)) struct Dual { Vector data; }; template consteval auto firstoff(){ static_assert(std::same_as, "type not implemented"); if constexpr (W==2) return Vec<2,int64_t>{0,1} != 0; else if constexpr (W == 4) return Vec<4,int64_t>{0,1,2,3} != 0; else if constexpr (W == 8) return Vec<8,int64_t>{0,1,2,3,4,5,6,7} != 0; else static_assert(false, "vector width not implemented"); } template [[gnu::always_inline]] constexpr auto operator+(Dual a, Dual b) -> Dual { if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){ Dual c; for (ptrdiff_t l = 0; l < Vector::L; ++l) c.data.data[l] = a.data.data[l] + b.data.data[l]; return c; } else return {a.value + b.value, a.partials + b.partials}; } template [[gnu::always_inline]] constexpr auto operator*(Dual a, Dual b) -> Dual { if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){ using V = typename Vector::V; V va = V{}+a.data.data[0][0], vb = V{}+b.data.data[0][0]; V x = va * b.data.data[0]; Dual c; c.data.data[0] = firstoff::W,T>() ? x + vb*a.data.data[0] : x; for (ptrdiff_t l = 1; l < Vector::L; ++l) c.data.data[l] = va*b.data.data[l] + vb*a.data.data[l]; return c; } else return {a.value * b.value, a.value * b.partials + b.value * a.partials}; } void prod(Dual,2> &c, const Dual,2> &a, const Dual,2>&b){ c = a*b; } void prod(Dual,2> &c, const Dual,2> &a, const Dual,2>&b){ c = a*b; } GCC 13.2 asm, when compiling with -std=gnu++23 -march=skylake-avx512 -mprefer-vector-width=512 -O3 prod(Dual, 2l>&, Dual, 2l> const&, Dual, 2l> const&): pushrbp mov eax, -2 kmovb k1, eax mov rbp, rsp and rsp, -64 sub rsp, 264 vmovdqa ymm4, YMMWORD PTR [rsi+128] vmovapd zmm8, ZMMWORD PTR [rsi]
[Bug tree-optimization/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #1 from Chris Elrod --- Here I have added a godbolt example where I manually unroll the array, where GCC generates excellent code https://godbolt.org/z/sd4bhGW7e I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on Skylake-X it is 38 uops for unrolled GCC with separate struct fields, vs 49 uops for Clang, vs 67 for GCC with arrays. uica expects <14 clock cycles for the manually unrolled vs >23 for the array version. My experience so far with expression templates has born this out: compilers seem to struggle with peeling away abstractions.
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #2 from Chris Elrod --- https://godbolt.org/z/3648aMTz8 Perhaps a simpler diff is that you can reproduce by uncommenting the pragma, but codegen becomes good with it. template constexpr auto operator*(OuterDualUA2 a, OuterDualUA2 b)->OuterDualUA2{ //return {a.value*b.value,a.value*b.p[0]+b.value*a.p[0],a.value*b.p[1]+b.value*a.p[1]}; OuterDualUA2 c; c.value = a.value*b.value; #pragma GCC unroll 16 for (ptrdiff_t i = 0; i < 2; ++i) c.p[i] = a.value*b.p[i] + b.value*a.p[i]; //c.p[0] = a.value*b.p[0] + b.value*a.p[0]; //c.p[1] = a.value*b.p[1] + b.value*a.p[1]; return c; } It's not great to have to add pragmas everywhere to my actual codebase. I thought I hit the important cases, but my non-minimal example still gets unnecessary register splits and stack spills, so maybe I missed places, or perhaps there's another issue. Given that GCC unrolls the above code even without the pragma, it seems like a definite bug that the pragma is needed for the resulting code generation to actually be good. Not knowing the compiler pipeline, my naive guess is that the pragma causes earlier unrolling than whatever optimization pass does it sans pragma, and that some important analysis/optimization gets run between those two times.
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #3 from Chris Elrod --- > I thought I hit the important cases, but my non-minimal example still gets > unnecessary register splits and stack spills, so maybe I missed places, or > perhaps there's another issue. Adding the unroll pragma to the `Vector`'s operator + and *: template [[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] * y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator+(T x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n]; return z; } template [[gnu::always_inline]] constexpr auto operator*(T x, Vector y) -> Vector { Vector z; #pragma GCC unroll 16 for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n]; return z; } does not improve code generation (still get the same problem), so that's a reproducer for such an issue.
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #6 from Chris Elrod --- Hongtao Liu, I do think that one should ideally be able to get optimal codegen when using 512-bit builtin vectors or vector intrinsics, without needing to set `-mprefer-vector-width=512` (and, currently, also setting `-mtune-ctrl=avx512_move_by_pieces`). For example, if I remove `-mprefer-vector-width=512`, I get prod(Dual, 2l>&, Dual, 2l> const&, Dual, 2l> const&): pushrbp mov eax, -2 kmovb k1, eax mov rbp, rsp and rsp, -64 sub rsp, 264 vmovdqa ymm4, YMMWORD PTR [rsi+128] vmovapd zmm8, ZMMWORD PTR [rsi] vmovapd zmm9, ZMMWORD PTR [rdx] vmovdqa ymm6, YMMWORD PTR [rsi+64] vmovdqa YMMWORD PTR [rsp+8], ymm4 vmovdqa ymm4, YMMWORD PTR [rdx+96] vbroadcastsdzmm0, xmm8 vmovdqa ymm7, YMMWORD PTR [rsi+96] vbroadcastsdzmm1, xmm9 vmovdqa YMMWORD PTR [rsp-56], ymm6 vmovdqa ymm5, YMMWORD PTR [rdx+128] vmovdqa ymm6, YMMWORD PTR [rsi+160] vmovdqa YMMWORD PTR [rsp+168], ymm4 vxorpd xmm4, xmm4, xmm4 vaddpd zmm0, zmm0, zmm4 vaddpd zmm1, zmm1, zmm4 vmovdqa YMMWORD PTR [rsp-24], ymm7 vmovdqa ymm7, YMMWORD PTR [rdx+64] vmovapd zmm3, ZMMWORD PTR [rsp-56] vmovdqa YMMWORD PTR [rsp+40], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+160] vmovdqa YMMWORD PTR [rsp+200], ymm5 vmulpd zmm2, zmm0, zmm9 vmovdqa YMMWORD PTR [rsp+136], ymm7 vmulpd zmm5, zmm1, zmm3 vbroadcastsdzmm3, xmm3 vmovdqa YMMWORD PTR [rsp+232], ymm6 vaddpd zmm3, zmm3, zmm4 vmovapd zmm7, zmm2 vmovapd zmm2, ZMMWORD PTR [rsp+8] vfmadd231pd zmm7{k1}, zmm8, zmm1 vmovapd zmm6, zmm5 vmovapd zmm5, ZMMWORD PTR [rsp+136] vmulpd zmm1, zmm1, zmm2 vfmadd231pd zmm6{k1}, zmm9, zmm3 vbroadcastsdzmm2, xmm2 vmovapd zmm3, ZMMWORD PTR [rsp+200] vaddpd zmm2, zmm2, zmm4 vmovapd ZMMWORD PTR [rdi], zmm7 vfmadd231pd zmm1{k1}, zmm9, zmm2 vmulpd zmm2, zmm0, zmm5 vbroadcastsdzmm5, xmm5 vmulpd zmm0, zmm0, zmm3 vbroadcastsdzmm3, xmm3 vaddpd zmm5, zmm5, zmm4 vaddpd zmm3, zmm3, zmm4 vfmadd231pd zmm2{k1}, zmm8, zmm5 vfmadd231pd zmm0{k1}, zmm8, zmm3 vaddpd zmm2, zmm2, zmm6 vaddpd zmm0, zmm0, zmm1 vmovapd ZMMWORD PTR [rdi+64], zmm2 vmovapd ZMMWORD PTR [rdi+128], zmm0 vzeroupper leave ret prod(Dual, 2l>&, Dual, 2l> const&, Dual, 2l> const&): pushrbp mov rbp, rsp and rsp, -64 sub rsp, 648 vmovdqa ymm5, YMMWORD PTR [rsi+224] vmovdqa ymm3, YMMWORD PTR [rsi+352] vmovapd zmm0, ZMMWORD PTR [rdx+64] vmovdqa ymm2, YMMWORD PTR [rsi+320] vmovdqa YMMWORD PTR [rsp+104], ymm5 vmovdqa ymm5, YMMWORD PTR [rdx+224] vmovdqa ymm7, YMMWORD PTR [rsi+128] vmovdqa YMMWORD PTR [rsp+232], ymm3 vmovsd xmm3, QWORD PTR [rsi] vmovdqa ymm6, YMMWORD PTR [rsi+192] vmovdqa YMMWORD PTR [rsp+488], ymm5 vmovdqa ymm4, YMMWORD PTR [rdx+192] vmovapd zmm1, ZMMWORD PTR [rsi+64] vbroadcastsdzmm5, xmm3 vmovdqa YMMWORD PTR [rsp+200], ymm2 vmovdqa ymm2, YMMWORD PTR [rdx+320] vmulpd zmm8, zmm5, zmm0 vmovdqa YMMWORD PTR [rsp+8], ymm7 vmovdqa ymm7, YMMWORD PTR [rsi+256] vmovdqa YMMWORD PTR [rsp+72], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+128] vmovdqa YMMWORD PTR [rsp+584], ymm2 vmovsd xmm2, QWORD PTR [rdx] vmovdqa YMMWORD PTR [rsp+136], ymm7 vmovdqa ymm7, YMMWORD PTR [rdx+256] vmovdqa YMMWORD PTR [rsp+392], ymm6 vmovdqa ymm6, YMMWORD PTR [rdx+352] vmulsd xmm10, xmm3, xmm2 vmovdqa YMMWORD PTR [rsp+456], ymm4 vbroadcastsdzmm4, xmm2 vfmadd231pd zmm8, zmm4, zmm1 vmovdqa YMMWORD PTR [rsp+520], ymm7 vmovdqa YMMWORD PTR [rsp+616], ymm6 vmulpd zmm9, zmm4, ZMMWORD PTR [rsp+72] vmovsd xmm6, QWORD PTR [rsp+520] vmulpd zmm4, zmm4, ZMMWORD PTR [rsp+200] vmulpd zmm11, zmm5, ZMMWORD PTR [rsp+456] vmovsd QWORD PTR [rdi], xmm10 vmulpd zmm5, zmm5, ZMMWORD PTR [rsp+584] vmovapd ZMMWORD PTR [rdi+64], zmm8 vfmadd231pd zmm9, zmm0, QWORD PTR [rsp+8]{1to8} vfmadd231pd zmm4, zmm0, QWORD PTR [rsp+136]{1to8} vmovsd xmm0, QWORD PTR [rsp+392] vmulsd xmm7, xmm3, xmm0 vbroadcastsdzmm0, xmm0 vmulsd xmm3, xmm3, xmm6 vfmadd132pd zmm0, zmm11, zmm1 vbroadcastsdzmm6, xmm6 vfmadd132pd zmm1, zmm5, zmm6 vfmadd231sd xmm7, xmm2, QWORD PTR [rsp+8] vfmadd132sd
[Bug middle-end/112824] Stack spills and vector splitting with vector builtins
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824 --- Comment #8 from Chris Elrod --- > If it's designed the way you want it to be, another issue would be like, > should we lower 512-bit vector builtins/intrinsic to ymm/xmm when > -mprefer-vector-width=256, the answer is we'd rather not. To be clear, what I meant by > it would be great to respect > `-mprefer-vector-width=512`, it should ideally also be able to respect > vector builtins/intrinsics is that when someone uses 512 bit vector builtins, that codegen should generate 512 bit code regardless of `mprefer-vector-width` settings. That is, as a developer, I would want 512 bit builtins to mean we get 512-bit vector code generation. > If user explicitly use 512-bit vector type, builtins or intrinsics, gcc will > generate zmm no matter -mprefer-vector-width=. This is what I would want, and I'd also want it to apply to movement of `struct`s holding vector builtin objects, instead of the `ymm` usage as we see here. > And yes, there could be some mismatches between 512-bit intrinsic and > architecture tuning when you're using 512-bit intrinsic, and also rely on > compiler autogen to handle struct > For such case, an explicit -mprefer-vector-width=512 is needed. Note the template partial specialization template struct Vector{ static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth : ptrdiff_t(std::bit_ceil(size_t(N))); static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0); using V = Vec; V data[L]; static constexpr auto size()->ptrdiff_t{return N;} }; Thus, `Vector`s in this example may explicitly be structs containing arrays of vector builtins. I would expect these structs to not need an `mprefer-vector-width=512` setting for producing 512 bit code handling this struct. Given small `L`, I would also expect passing this struct as an argument by value to a non-inlined function to be done in `zmm` registers when possible, for example.