[Bug middle-end/95899] New: -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains

2020-06-25 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

Bug ID: 95899
   Summary: -funroll-loops does not duplicate accumulators when
calculating reductions, failing to break up dependency
chains
   Product: gcc
   Version: 10.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 48784
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48784&action=edit
cc -march=skylake-avx512 -mprefer-vector-width=512 -Ofast -funroll-loops -S
dot.c -o dot.s

Sample code:

```
double dot(double* a, double* b, long N){
  double s = 0.0;
  for (long n = 0; n < N; n++){
s += a[n] * b[n];
  }
  return s;
}
```

Relevant part of the asm:
```
.L4:
vmovupd (%rdi,%r11), %zmm8
vmovupd 64(%rdi,%r11), %zmm9
vfmadd231pd (%rsi,%r11), %zmm8, %zmm0
vmovupd 128(%rdi,%r11), %zmm10
vmovupd 192(%rdi,%r11), %zmm11
vmovupd 256(%rdi,%r11), %zmm12
vmovupd 320(%rdi,%r11), %zmm13
vfmadd231pd 64(%rsi,%r11), %zmm9, %zmm0
vmovupd 384(%rdi,%r11), %zmm14
vmovupd 448(%rdi,%r11), %zmm15
vfmadd231pd 128(%rsi,%r11), %zmm10, %zmm0
vfmadd231pd 192(%rsi,%r11), %zmm11, %zmm0
vfmadd231pd 256(%rsi,%r11), %zmm12, %zmm0
vfmadd231pd 320(%rsi,%r11), %zmm13, %zmm0
vfmadd231pd 384(%rsi,%r11), %zmm14, %zmm0
vfmadd231pd 448(%rsi,%r11), %zmm15, %zmm0
addq$512, %r11
cmpq%r8, %r11
jne .L4

```

Skylake-AVX512's vfmaddd should have a throughput of 2/cycle, but a latency of
4 cycles.

Because each unrolled instance accumulates into `%zmm0`, we are limited by the
dependency chain to 1 fma every 4 cycles.

It should use separate accumulators.

Additionally, if the loads are aligned, it would have a throughput of 2
loads/cycle. Because we need 2 loads per fma, that limits us to only 1 fma per
cycle. If the dependency chain were the primary motivation for unrolling, we'd
only want to unroll by 4, not 8. 4 cycles of latency, 1 fma per cycle -> 4
simultaneous / OoO fmas.

Something like a sum (1 load per add) would perform better with the 8x
unrolling seen here (at least, from 100 or so elements until it becomes memory
bound).

[Bug middle-end/95899] -funroll-loops does not duplicate accumulators when calculating reductions, failing to break up dependency chains

2020-06-25 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95899

--- Comment #2 from Chris Elrod  ---
Interesting. Compiling with:

gcc -march=native -fvariable-expansion-in-unroller -Ofast -funroll-loops -S
dot.c -o dot.s

Yields:

```
.L4:
vmovupd (%rdi,%r11), %zmm9
vmovupd 64(%rdi,%r11), %zmm10
vfmadd231pd (%rsi,%r11), %zmm9, %zmm0
vfmadd231pd 64(%rsi,%r11), %zmm10, %zmm1
vmovupd 128(%rdi,%r11), %zmm11
vmovupd 192(%rdi,%r11), %zmm12
vmovupd 256(%rdi,%r11), %zmm13
vfmadd231pd 128(%rsi,%r11), %zmm11, %zmm0
vfmadd231pd 192(%rsi,%r11), %zmm12, %zmm1
vmovupd 320(%rdi,%r11), %zmm14
vmovupd 384(%rdi,%r11), %zmm15
vmovupd 448(%rdi,%r11), %zmm4
vfmadd231pd 256(%rsi,%r11), %zmm13, %zmm0
vfmadd231pd 320(%rsi,%r11), %zmm14, %zmm1
vfmadd231pd 384(%rsi,%r11), %zmm15, %zmm0
vfmadd231pd 448(%rsi,%r11), %zmm4, %zmm1
addq$512, %r11
cmpq%r8, %r11
jne .L4
```

So the dependency chain has now been split in 2.
4 would be ideal. I'll try running benchmarks later to see how it does.
FWIW, the original ran at between 20 and 25 GFLOPS from roughly N = 80 through
N = 1024.
The fastest versions I benchmarked climbed from around 20 to 50 GFLOPS over
this range. So perhaps just splitting the dependency once can get it much of
the way there.

Out of curiosity, what's the reason for this being off by default for
everything but ppc?
Seems like it should turned on with `-funroll-loops`, given that breaking
dependency chains are one of the primary ways unrolling can actually help
performance.

[Bug fortran/88713] New: _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

Bug ID: 88713
   Summary: _gfortran_internal_pack@PLT prevents vectorization
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 45350
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45350&action=edit
Fortran version of vectorization test.

I am attaching Fortran and C++ translations of a simple working example.

The C++ version is vectorized, while the Fortran version is not.

The code consists of two functions. One simply runs a for loop, calling the
other function.
The function is vectorizable across loop iterations. g++ does this
succcesfully.

However, gfortran does not, because it repacks data with
call_gfortran_internal_pack@PLT
so that it can no longer be vectorized across iterations.


I compiled with:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
-fno-semantic-interposition -shared -fPIC -S vectorization_test.cpp -o
gfortvectorization_test.s

g++ -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC -S
vectorization_test.cpp -o gppvectorization_test.s


LLVM (via flang and clang) successfully vectorizes both versions.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #1 from Chris Elrod  ---
Created attachment 45351
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45351&action=edit
C++ version of the vectorization test case.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #2 from Chris Elrod  ---
Created attachment 45352
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45352&action=edit
gfortran assembly output

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-05 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #3 from Chris Elrod  ---
Created attachment 45353
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45353&action=edit
g++ assembly output

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #6 from Chris Elrod  ---
Created attachment 45356
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45356&action=edit
Code to demonstrate that transposing makes things slower.

Thomas Koenig, I am well aware that Fortran is column major. That is precisely
why I chose the memory layout I did.

Benchmark of the "optimal" corrected code:

@benchmark gforttest($X32t, $BPP32t, $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 20.647 μs (0.00% GC)
  median time:  20.860 μs (0.00% GC)
  mean time:21.751 μs (0.00% GC)
  maximum time: 47.760 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 1


Here is a benchmark (compiling with Flang) of my code, exactly as written
(suboptimal) in the attachments:

@benchmark flangtest($X32,  $BPP32,  $N)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 658.329 ns (0.00% GC)
  median time:  668.012 ns (0.00% GC)
  mean time:692.384 ns (0.00% GC)
  maximum time: 1.192 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 161


That is 20 microseconds, vs 670 nanoseconds.

N was 1024, and the exact same data used in both cases (but pretransposed, so I
do not benchmark tranposing).
Benchmarking was done by compiling shared libraries, and using `ccall` and
BenchmarkTools from Julia. As indicated by the reports, the benchmark was run
10,000 times for gfortran, and 1.61 million times for Flang, to get accurate
timings.

I compiled with (march=native is equivalent to march=skylake-avx512):
gfortran -Ofast -march=native -mprefer-vector-width=512
-fno-semantic-interposition -shared -fPIC vectorization_test_transposed.f90 -o
libgfortvectorization_test.so
flang -Ofast -march=native -mprefer-vector-width=512 -shared -fPIC
vectorization_test.f90 -o libflangvectorization_test.so

Flang was built with LLVM 7.0.1.


The "suboptimal" code was close to 32 times faster than the "optimal" code.
I was expecting it to be closer to 16 times faster, given the vector width.


To go into more detail:

"
Fortran lays out the memory for that array as

BPP(1,1), BPP(2,1), BPP(3,1), BPP(4,1), ..., BPP(1,2)

so you are accessing your memory with a stride of n in the
expressions BPP(i,1:3) and BPP(i,5:10). This is very inefficient
anyway, vectorization would not really help in this case.
"

Yes, each call to fpdbacksolve is accessing memory across strides.
But fpdbacksolve itself cannot be vectorized well at all.

What does work, however, is vectorizing across loop iterations.

For example, imagine calling fpdbacksolve on this:

BPP(1:16,1), BPP(1:16,2), BPP(1:16,3), BPP(1:16,5), ..., BPP(1:16,10)

and then performing every single scalar operation defined in fpdbacksolve on an
entire SIMD vector of floats (that is, on 16 floats) at a time.

That would of course require inlining fpdbacksolve (which was achieved with
-fno-semantic-interposition, as the assembly shows), and recompiling it.

Perhaps another way you can imagine it is that fpdbacksolve takes in 9 numbers
(BPP(:,4) was unused), and returns 3 numbers.
Because operations within it aren't vectorizable, we want to vectorize it
ACROSS loop iterations, not within them.
So to facilitate that, we have 9 vectors of contiguous inputs, and 3 vectors of
contiguous outputs. Now, all inputs1 are stored contiguously, as are all
inputs2, etc..., allowing the inputs to efficiently be loaded into SIMD
registers, and each loop iteration to calculate [SIMD vector width] of the
outputs at a time.

Of course, it is inconvenient to handle a dozen vectors. So if they all have
the same length, we can just concatenate them together.


I'll attach the assembly of both code examples as well.
The assembly makes it clear that the "suboptimal" way was vectorized, and the
"optimal" way was not.

The benchmarks make it resoundingly clear that the vectorized ("suboptimal")
version was dramatically faster.

As is, this is a missed optimization, and gfortran is severely falling behind
in performance versus LLVM-based Flang in the highest performance version of
the code.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #7 from Chris Elrod  ---
Created attachment 45357
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45357&action=edit
Assembly generated by Flang compiler on the original version of the code.

This is the main loop body in the Flang compiled version of the original code
(starts line 132):

.LBB1_8:# %vector.body
# =>This Inner Loop Header: Depth=1
leaq(%rsi,%rbx,4), %r12
vmovups (%rcx,%r12), %zmm2
addq%rcx, %r12
leaq(%r12,%rcx), %rbp
vmovups (%r11,%rbp), %zmm3
addq%r11, %rbp
leaq(%rcx,%rbp), %r13
leaq(%rcx,%r13), %r8
leaq(%r8,%rcx), %r10
leaq(%r10,%rcx), %r14
vmovups (%rcx,%r14), %zmm4
vrsqrt14ps  %zmm4, %zmm5
vmulps  %zmm5, %zmm4, %zmm4
vfmadd213ps %zmm0, %zmm5, %zmm4 # zmm4 = (zmm5 * zmm4) + zmm0
vmulps  %zmm1, %zmm5, %zmm5
vmulps  %zmm4, %zmm5, %zmm4
.Ltmp1:
.loc1 31 1 is_stmt 1# vectorization_test.f90:31:1
vmulps  (%rcx,%r8), %zmm4, %zmm5
.loc1 32 1  # vectorization_test.f90:32:1
vmulps  (%rcx,%r10), %zmm4, %zmm6
vmovups (%rcx,%r13), %zmm7
.loc1 33 1  # vectorization_test.f90:33:1
vfnmadd231ps%zmm6, %zmm6, %zmm7 # zmm7 = -(zmm6 * zmm6) + zmm7
vrsqrt14ps  %zmm7, %zmm8
vmulps  %zmm8, %zmm7, %zmm7
vfmadd213ps %zmm0, %zmm8, %zmm7 # zmm7 = (zmm8 * zmm7) + zmm0
vmulps  %zmm1, %zmm8, %zmm8
vmulps  %zmm7, %zmm8, %zmm7
vmovups (%rcx,%rbp), %zmm8
.loc1 35 1  # vectorization_test.f90:35:1
vfnmadd231ps%zmm5, %zmm6, %zmm8 # zmm8 = -(zmm6 * zmm5) + zmm8
vmulps  %zmm8, %zmm7, %zmm8
vmulps  %zmm5, %zmm5, %zmm9
vfmadd231ps %zmm8, %zmm8, %zmm9 # zmm9 = (zmm8 * zmm8) + zmm9
vsubps  %zmm9, %zmm3, %zmm3
vrsqrt14ps  %zmm3, %zmm9
vmulps  %zmm9, %zmm3, %zmm3
vfmadd213ps %zmm0, %zmm9, %zmm3 # zmm3 = (zmm9 * zmm3) + zmm0
vmulps  %zmm1, %zmm9, %zmm9
vmulps  %zmm3, %zmm9, %zmm3
.loc1 39 1  # vectorization_test.f90:39:1
vmulps  %zmm8, %zmm7, %zmm8
.loc1 40 1  # vectorization_test.f90:40:1
vmulps  (%rcx,%r12), %zmm4, %zmm4
.loc1 39 1  # vectorization_test.f90:39:1
vmulps  %zmm3, %zmm8, %zmm8
.loc1 41 1  # vectorization_test.f90:41:1
vmulps  %zmm8, %zmm2, %zmm9
vfmsub231ps (%rsi,%rbx,4), %zmm3, %zmm9 # zmm9 = (zmm3 * mem) -
zmm9
vmulps  %zmm5, %zmm3, %zmm3
vfmsub231ps %zmm8, %zmm6, %zmm3 # zmm3 = (zmm6 * zmm8) - zmm3
vfmadd213ps %zmm9, %zmm4, %zmm3 # zmm3 = (zmm4 * zmm3) + zmm9
.loc1 42 1  # vectorization_test.f90:42:1
vmulps  %zmm4, %zmm6, %zmm5
vmulps  %zmm5, %zmm7, %zmm5
vfmsub231ps %zmm7, %zmm2, %zmm5 # zmm5 = (zmm2 * zmm7) - zmm5
.Ltmp2:
.loc1 15 1  # vectorization_test.f90:15:1
vmovups %zmm3, (%rdi,%rbx,4)
movq-16(%rsp), %rbp # 8-byte Reload
vmovups %zmm5, (%rbp,%rbx,4)
vmovups %zmm4, (%rax,%rbx,4)
addq$16, %rbx
cmpq%rbx, %rdx
jne .LBB1_8



zmm registers are 64 byte registers. It vmovups from memory into registers,
performs a series of arithmetics and inverse square roots on them, and then
vmovups three of these 64 byte registers back into memory.

That is the most efficient memory access pattern (as demonstrated empirically
via benchmarks).

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #8 from Chris Elrod  ---
Created attachment 45358
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45358&action=edit
gfortran compiled assembly for the tranposed version of the original code.

Here is the assembly for the loop body of the transposed version of the code,
compiled by gfortran:


.L8:
vmovss  36(%rsi), %xmm0
addq$40, %rsi
vrsqrtss%xmm0, %xmm2, %xmm2
addq$12, %rdi
vmulss  %xmm0, %xmm2, %xmm0
vmulss  %xmm2, %xmm0, %xmm0
vmulss  %xmm7, %xmm2, %xmm2
vaddss  %xmm8, %xmm0, %xmm0
vmulss  %xmm2, %xmm0, %xmm0
vmulss  -8(%rsi), %xmm0, %xmm5
vmulss  -12(%rsi), %xmm0, %xmm4
vmulss  -32(%rsi), %xmm0, %xmm0
vmovaps %xmm5, %xmm3
vfnmadd213ss-16(%rsi), %xmm5, %xmm3
vmovaps %xmm4, %xmm2
vfnmadd213ss-20(%rsi), %xmm5, %xmm2
vmovss  %xmm0, -4(%rdi)
vrsqrtss%xmm3, %xmm1, %xmm1
vmulss  %xmm3, %xmm1, %xmm3
vmulss  %xmm1, %xmm3, %xmm3
vmulss  %xmm7, %xmm1, %xmm1
vaddss  %xmm8, %xmm3, %xmm3
vmulss  %xmm1, %xmm3, %xmm3
vmulss  %xmm3, %xmm2, %xmm6
vmovaps %xmm4, %xmm2
vfnmadd213ss-24(%rsi), %xmm4, %xmm2
vfnmadd231ss%xmm6, %xmm6, %xmm2
vrsqrtss%xmm2, %xmm10, %xmm10
vmulss  %xmm2, %xmm10, %xmm1
vmulss  %xmm10, %xmm1, %xmm1
vmulss  %xmm7, %xmm10, %xmm10
vaddss  %xmm8, %xmm1, %xmm1
vmulss  %xmm10, %xmm1, %xmm1
vmulss  %xmm1, %xmm3, %xmm2
vmulss  %xmm6, %xmm2, %xmm2
vmovss  -36(%rsi), %xmm6
vxorps  %xmm9, %xmm2, %xmm2
vmulss  %xmm6, %xmm2, %xmm10
vmulss  %xmm2, %xmm5, %xmm2
vfmadd231ss -40(%rsi), %xmm1, %xmm10
vfmadd132ss %xmm4, %xmm2, %xmm1
vfnmadd132ss%xmm0, %xmm10, %xmm1
vmulss  %xmm0, %xmm5, %xmm0
vmovss  %xmm1, -12(%rdi)
vsubss  %xmm0, %xmm6, %xmm0
vmulss  %xmm3, %xmm0, %xmm3
vmovss  %xmm3, -8(%rdi)
cmpq%rsi, %rax
jne .L8


While Flang had a second loop of scalar code (to catch the N mod [SIMD vector
width] remainder of the vectorized loop), there are no secondary loops in the
gfortran code, meaning these must all be scalar operations (I have a hard time
telling apart SSE from scalar code...).

It looks similar in the operations it performs to Flang's vectorized loop,
except that it is only performing operations on a single number at a time.
Because to get efficient vectorization, we need corresponding elements to be
contiguous (ie, all the input1s, all the input2s).
We do not get any benefit from having all the different elements with the same
index (the first input1 next to the first input2, next to the first input3...)
being contiguous.


The memory layout I used is performance-optimal, but is something that gfortran
unfortunately often cannot handle automatically (without manual unrolling).
This is why I filed a report on bugzilla.

[Bug fortran/88713] _gfortran_internal_pack@PLT prevents vectorization

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #10 from Chris Elrod  ---
(In reply to Thomas Koenig from comment #9)
> Hm.
> 
> It would help if your benchmark was complete, so I could run it.
> 

I don't suppose you happen to have and be familiar with Julia? If you (or
someone else here is), I'll attach the code to generate the fake data (the most
important point is that columns 5:10 of BPP are the upper triangle of a 3x3
symmetric positive definite matrix).

I have also already written a manually unrolled version that gfortran likes..

But I could write Fortran code to create an executable and run benchmarks.
What are best practices? system_clock?

(In reply to Thomas Koenig from comment #9)
> 
> However, what happens if you put int
> 
> real, dimension(:) ::  Uix
> real, dimension(:), intent(in)  ::  x
> real, dimension(:), intent(in)  ::  S
> 
> ?
> 
> gfortran should not pack then.

You're right! I wasn't able to follow this exactly, because it didn't want me
to defer shape on Uix. Probably because it needs to compile a version of
fpdbacksolve that can be called from the shared library?

Interestingly, with that change, Flang failed to vectorize the code, but
gfortran did. Compilers are finicky.

Flang, original:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 655.827 ns (0.00% GC)
  median time:  665.698 ns (0.00% GC)
  mean time:689.967 ns (0.00% GC)
  maximum time: 1.061 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 162

Flang, not specifying shape: # assembly shows it is using xmm

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 8.086 μs (0.00% GC)
  median time:  8.315 μs (0.00% GC)
  mean time:8.591 μs (0.00% GC)
  maximum time: 20.299 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 3

gfortran, transposed version (not vectorizable): 

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 20.643 μs (0.00% GC)
  median time:  20.901 μs (0.00% GC)
  mean time:21.441 μs (0.00% GC)
  maximum time: 54.103 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 1

gfortran, not specifying shape:

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 1.290 μs (0.00% GC)
  median time:  1.316 μs (0.00% GC)
  mean time:1.347 μs (0.00% GC)
  maximum time: 4.562 μs (0.00% GC)
  --
  samples:  1
  evals/sample: 10


Assembly confirms it is using zmm registers (but this time is much too fast not
to be vectorized, anyway).


For why gfortran is still slower than the Flang version, here is the loop body:

.L16:
vmovups (%r10,%rax), %zmm0
vcmpps  $4, %zmm0, %zmm4, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
vmulps  %zmm0, %zmm1, %zmm2
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm5, %zmm2, %zmm2
vaddps  %zmm6, %zmm0, %zmm0
vmulps  %zmm2, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm8
vmulps  %zmm0, %zmm8, %zmm0
vmulps  %zmm0, %zmm8, %zmm0
vaddps  %zmm8, %zmm8, %zmm8
vsubps  %zmm0, %zmm8, %zmm8
vmulps  (%r8,%rax), %zmm8, %zmm9
vmulps  (%r9,%rax), %zmm8, %zmm10
vmulps  (%r12,%rax), %zmm8, %zmm8
vmovaps %zmm9, %zmm3
vfnmadd213ps0(%r13,%rax), %zmm9, %zmm3
vcmpps  $4, %zmm3, %zmm4, %k1
vrsqrt14ps  %zmm3, %zmm2{%k1}{z}
vmulps  %zmm3, %zmm2, %zmm3
vmulps  %zmm2, %zmm3, %zmm1
vmulps  %zmm5, %zmm3, %zmm3
vaddps  %zmm6, %zmm1, %zmm1
vmulps  %zmm3, %zmm1, %zmm1
vmovaps %zmm9, %zmm3
vfnmadd213ps(%rdx,%rax), %zmm10, %zmm3
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm11
vmulps  %zmm11, %zmm3, %zmm12
vmovaps %zmm10, %zmm3
vfnmadd213ps(%r14,%rax), %zmm10, %zmm3
vfnmadd231ps%zmm12, %zmm12, %zmm3
vcmpps  $4, %zmm3, %zmm4, %k1
vrsqrt14ps  %zmm3, %zmm1{%k1}{z}
vmulps  %zmm3, %zmm1, %zmm3
vmulps  %zmm1, %zmm3, %zmm0
vmulps  %zmm5, %zmm3, %zmm3
vmovups (%rcx,%rax), %zmm1
vaddps  %zmm6, %zmm0, %zmm0
vmulps  %zmm3, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm2
vmulps  %zmm0, %zmm2, %zmm0
vmulps  %zmm0, %zmm2, %zmm0
vaddps  %zmm2, %zmm2, %zmm2
vsubps  %zmm0, %zmm2, %zmm0
vmulps  %zmm0, %zmm11, %zmm3
vmulps  %zmm12, %zmm3, %zmm3
vxorps  %zmm7, %zmm3, %zmm3
vmulps  %zmm1, %zmm3, %zmm2
vmulps  %zmm3, %zmm9, %zmm3
vfnmadd231ps%zmm8, %zmm9, %zmm1
vfmadd231p

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #12 from Chris Elrod  ---
Created attachment 45363
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45363&action=edit
Fortran program for running benchmarks.

Okay, thank you.

I attached a Fortran program you can run to benchmark the code.
It randomly generates valid inputs, and then times running the code 10^5 times.
Finally, it reports the average time in microseconds.

The SIMD times are the vectorized version, and the transposed times are the
non-vectorized versions. In both cases, Flang produces much faster code.

The results seem in line with what I got benchmarking shared libraries from
Julia.
I linked rt for access to the high resolution clock.


$ gfortran -Ofast -lrt -march=native -mprefer-vector-width=512
vectorization_tests.F90 -o gfortvectests

$ time ./gfortvectests 
 Transpose benchmark completed in   22.7799759
 SIMD benchmark completed in   1.34003162
 All are equal: F
 All are approximately equal: F
 Maximum relative error   8.27204276E-05
 First record X:   1.02466011 -0.689792156 -0.404027045
 First record Xt:   1.02465975 -0.689791918 -0.404026985
 Second record X: -0.546353579   3.37308086E-03   1.15257287
 Second record Xt: -0.546353400   3.37312138E-03   1.15257275

real0m2.418s
user0m2.412s
sys 0m0.003s

$ flang -Ofast -lrt -march=native -mprefer-vector-width=512
vectorization_tests.F90 -o flangvectests

$ time ./flangvectests 
 Transpose benchmark completed in7.232568
 SIMD benchmark completed in   0.6596010
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.58675421.568364   0.1006735
 First record Xt:   0.58675411.568363   0.1006735
 Second record X:   0.2894785  -0.1510675  -9.3419194E-02
 Second record Xt:   0.2894785  -0.1510675  -9.3419187E-02

real0m0.801s
user0m0.794s
sys 0m0.005s

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-06 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #14 from Chris Elrod  ---
It's not really reproducible across runs:

$ time ./gfortvectests 
 Transpose benchmark completed in   22.7010765
 SIMD benchmark completed in   1.37529969
 All are equal: F
 All are approximately equal: F
 Maximum relative error   6.20566949E-04
 First record X:  0.188879877  0.377619117  -1.67841911E-02
 First record Xt:  0.10071  0.377619147  -1.67841911E-02
 Second record X:  -8.14126506E-02 -0.421755224 -0.199057430
 Second record Xt:  -8.14126655E-02 -0.421755224 -0.199057430

real0m2.414s
user0m2.406s
sys 0m0.005s

$ time ./flangvectests 
 Transpose benchmark completed in7.630980
 SIMD benchmark completed in   0.6455200
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.58675421.568364   0.1006735
 First record Xt:   0.58675411.568363   0.1006735
 Second record X:   0.2894785  -0.1510675  -9.3419194E-02
 Second record Xt:   0.2894785  -0.1510675  -9.3419187E-02

real0m0.839s
user0m0.832s
sys 0m0.006s

$ time ./gfortvectests 
 Transpose benchmark completed in   22.0195961
 SIMD benchmark completed in   1.36087596
 All are equal: F
 All are approximately equal: F
 Maximum relative error   2.49150675E-04
 First record X: -0.284217566   2.13768221E-02 -0.475293010
 First record Xt: -0.284217596   2.13767942E-02 -0.475293040
 Second record X:   1.75664220E-02  -9.29893106E-02  -4.37139049E-02
 Second record Xt:   1.75664220E-02  -9.29893106E-02  -4.37139049E-02

real0m2.344s
user0m2.338s
sys 0m0.003s

$ time ./flangvectests 
 Transpose benchmark completed in7.881181
 SIMD benchmark completed in   0.6132510
 All are equal:  F
 All are approximately equal:  F
 Maximum relative error   2.0917827E-04
 First record X:   0.58675421.568364   0.1006735
 First record Xt:   0.58675411.568363   0.1006735
 Second record X:   0.2894785  -0.1510675  -9.3419194E-02
 Second record Xt:   0.2894785  -0.1510675  -9.3419187E-02

real0m0.861s
user0m0.853s
sys 0m0.006s


It's also probably wasn't quite right to call it "error", because it's
comparing the values from the scalar and vectorized versions. Although it is
unsettling if the differences are high; there should be an exact match,
ideally.

Back to Julia, using mpfr (set to 252 bits of precision), and rounding to
single precision for an exactly rounded answer...

X32gfort # calculated from gfortran
X32flang # calculated from flang
Xbf  # mpfr, 252-bit precision ("BigFloat" in Julia)

julia> Xbf32 = Float32.(Xbf) # correctly rounded result

julia> function ULP(x, correct) # calculates ULP error
   x == correct && return 0
   if x < correct
   error = 1
   while nextfloat(x, error) != correct
   error += 1
   end
   else
   error = 1
   while prevfloat(x, error) != correct
   error += 1
   end
   end
   error
   end
ULP (generic function with 1 method)

julia> ULP.(X32gfort, Xbf32)'
3×1024 Adjoint{Int64,Array{Int64,2}}:
 7  1  1  8  3  2  1  1  1  27  4  1  4  6  0  0  2  0  2  4  0  7  1  1  3  8 
4  2  2  …  1  0  2  0  0  1  2  3  1  5  1  1  0  0  0  2  3  2  1  2  3  1  0
 1  1  0  2  0  41
 4  2  1  1  6  1  0  1  1   2  2  0  0  3  0  1  0  3  1  1  0  1  1  0  0  3 
1  0  0 0  1  0  1  0  1  0  1  1  4  1  1  0  2  0  1  0  1  0  0  0  1  2
 1  1  1  0  0   1
 1  1  0  1  1  0  0  0  0   1  1  0  0  1  0  1  1  1  0  1  1  0  0  1  0  1 
0  0  0 0  0  1  0  0  0  0  0  1  0  0  1  1  1  0  0  1  0  1  1  0  1  1
 0  0  0  0  0   1

julia> mean(ans)
1.9462890625

julia> ULP.(X32flang, Xbf32)'
3×1024 Adjoint{Int64,Array{Int64,2}}:
 4  1  0  3  0  0  0  1  1  5  2  1  1  6  3  0  1  0  0  1  1  21  0  1  2  8 
2  3  0  0  …  1  1  1  15  2  1  1  5  1  1  1  0  0  0  0  0  2  1  3  1  1 
1  1  1  1  1  0  11
 3  1  1  0  1  0  0  1  0  0  1  0  0  2  1  1  1  6  0  0  0   2  1  0  1  4 
1  1  0  3 1  1  1   1  2  1  1  0  1  1  0  0  1  0  1  0  0  1  0  0  1 
1  1  0  1  0  0   0
 1  0  1  0  0  0  1  1  0  1  0  0  0  1  1  0  0  1  1  0  1   1  0  1  0  1 
0  0  1  0 0  0  1   0  0  0  0  0  0  2  0  0  0  0  0  1  1  1  1  0  1 
0  0  0  0  0  0   1

julia> mean(ans)
1.3388671875


So in that case, gfortran's version had about 1.95 ULP error on average, and
Flang about 1.34 ULP error.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-07 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #18 from Chris Elrod  ---
I can confirm that the inlined packing does allow gfortran to vectorize the
loop. So allowing packing to inline does seem (to me) like an optimization well
worth making.




However, performance seems to be about the same as before, still close to 2x
slower than Flang.


There is definitely something interesting going on in Flang's SLP
vectorization, though.

I defined the function:

#ifndef VECTORWIDTH
#define VECTORWIDTH 16
#endif

subroutine vpdbacksolve(Uix, x, S)

real, dimension(VECTORWIDTH,3)  ::  Uix
real, dimension(VECTORWIDTH,3), intent(in)  ::  x
real, dimension(VECTORWIDTH,6), intent(in)  ::  S

real, dimension(VECTORWIDTH)::  U11,  U12,  U22,  U13,  U23,  U33,
&
Ui11, Ui12, Ui22, Ui33

U33 = sqrt(S(:,6))

Ui33 = 1 / U33
U13 = S(:,4) * Ui33
U23 = S(:,5) * Ui33
U22 = sqrt(S(:,3) - U23**2)
Ui22 = 1 / U22
U12 = (S(:,2) - U13*U23) * Ui22
U11 = sqrt(S(:,1) - U12**2 - U13**2)

Ui11 = 1 / U11 ! u11
Ui12 = - U12 * Ui11 * Ui22 ! u12
Uix(:,3) = Ui33*x(:,3)
Uix(:,1) = Ui11*x(:,1) + Ui12*x(:,2) - (U13 * Ui11 + U23 * Ui12) *
Uix(:,3)
Uix(:,2) = Ui22*x(:,2) - U23 * Ui22 * Uix(:,3)

end subroutine vpdbacksolve


in a .F90 file, so that VECTORWIDTH can be set appropriately while compiling.

I wanted to modify the Fortran file to benchmark these, but I'm pretty sure
Flang cheated in the benchmarks. So compiling into a shared library, and
benchmarking from Julia:

julia> @benchmark flangvtest($Uix, $x, $S)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 15.104 ns (0.00% GC)
  median time:  15.563 ns (0.00% GC)
  mean time:16.017 ns (0.00% GC)
  maximum time: 49.524 ns (0.00% GC)
  --
  samples:  1
  evals/sample: 998

julia> @benchmark gfortvtest($Uix, $x, $S)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --
  minimum time: 24.394 ns (0.00% GC)
  median time:  24.562 ns (0.00% GC)
  mean time:25.600 ns (0.00% GC)
  maximum time: 58.652 ns (0.00% GC)
  --
  samples:  1
  evals/sample: 996

That is over 60% faster for Flang, which would account for much, but not all,
of the runtime difference in the actual for loops.

For comparison, the vectorized loop in processbpp covers 16 samples per
iteration. The benchmarks above were with N = 1024, so 1024/16 = 64 iterations.

For the three gfortran benchmarks (that averaged 100,000 runs of the loop),
that means each loop iteration averaged at about
1000 * (1.34003162 + 1.37529969 + 1.36087596) / (3*64)
21.230246197916664

For flang, that was:
1000 * (0.6596010 + 0.6455200 + 0.6132510) / (3*64)
9.99152083334

so we have about 21 vs 10 ns for the loop body in gfortran vs Flang,
respectively.


Comparing the asm between:
1. Flang processbpp loop body
2. Flang vpdbacksolve
3. gfortran processbpp loop body
4. gfortran vpdbacksolve

Here are a few things I notice.
1. gfortran always uses masked reciprocal square root operations, to make sure
it only takes the square root of non-negative (positive?) numbers:
vxorps  %xmm5, %xmm5, %xmm5
...
vmovups (%rsi,%rax), %zmm0
vmovups 0(%r13,%rax), %zmm9
vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

This might be avx512f specific? 
Either way, Flang does not use masks:

vmovups (%rcx,%r14), %zmm4
vrsqrt14ps  %zmm4, %zmm5

I'm having a hard time finding any information on what the performance impact
of this may be.
Agner Fog's instruction tables, for example, don't mention mask arguments for
vrsqrt14ps.

2. Within the loop body, Flang has 0 unnecessary vmov(u/a)ps. There are 8 total
plus 3 "vmuls" and 1 vfmsub231ps accessing memory, for the 12 expected per loop
iteration (fpdbacksolve's arguments are a vector of length 3 and another of
length 6; it returns a vector of length 3).

gfortran's loop body has 3 unnecessary vmovaps, copying register contents.

gfortran's vpdbacksolve subroutine has 4 unnecessary vmovaps, copying register
contents.

Flang's vpdbacksolve subroutine has 13 unnecessary vmovaps, and a couple
unnecessary memory accesses. Ouch!
They also moved on/off (the stack?)

vmovaps %zmm2, .BSS4+192(%rip)
...
vmovaps %zmm5, .BSS4+320(%rip)
...
vmovaps .BSS4+192(%rip), %zmm5
... #zmm5 is overwritten in here, I just mean to show the sort of stuff that
goes on
vmulps  .BSS4+320(%rip), %zmm5, %zmm0

Some of those moves also don't get used again, and some other things are just
plain weird:
vxorps  %xmm3, %xmm3, %xmm3
vfnmsub231ps%zmm2, %zmm0, %zmm3 # zmm3 = -(zmm0 * zmm2) - zmm3
vmovaps %zmm3, .BSS4+576(%rip)

Like, why zero out the 128 bit portion of zmm3 ?
I verifie

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #19 from Chris Elrod  ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the Newton step, the answers are correct.

My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
the Newton step. They get the correct answer.

That leaves my best guess for the performance difference as owing to the masked
"vrsqrt14ps" that gcc is using:

vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

Is there any way for me to test that idea?
Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
benchmark it?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #20 from Chris Elrod  ---
To add a little more:
I used inline asm for direct access to the rsqrt instruction "vrsqrt14ps" in
Julia. Without adding a Newton step, the answers are wrong beyond just a couple
significant digits.
With the Newton step, the answers are correct.

My point is that LLVM-compiled code (Clang/Flang/ispc) are definitely adding
the Newton step. They get the correct answer.

That leaves my best guess for the performance difference as owing to the masked
"vrsqrt14ps" that gcc is using (g++ does this too):

vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

Is there any way for me to test that idea?
Edit the asm to remove the vcmppss and mask, compile the asm with gcc, and
benchmark it?


Okay, I just tried playing around with flags and looking at asm.
I compiled with:

g++ -O3 -ffinite-math-only -fexcess-precision=fast -fno-math-errno
-fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math
-fno-rounding-math -fno-signaling-nans -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
libgppvectorization_test.so  vectorization_test.cpp

which is basically all flags implied by "-ffast-math", except
"-funsafe-math-optimizations". This does include the flags implied by the
unsafe-math optimizations, just not that flag itself.

This list can be simplified to (only "-fno-math-errno" is needed):

g++ -O3 -fno-math-errno -march=native -shared -fPIC -mprefer-vector-width=512
-fno-semantic-interposition -o libgppvectorization_test.so 
vectorization_test.cpp

or

gfortran -O3 -fno-math-errno -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
libgfortvectorization_test.so  vectorization_test.f90

This results in the following:

vsqrtps (%r8,%rax), %zmm0
vdivps  %zmm0, %zmm7, %zmm0

ie, vsqrt and a division, rather than the masked reciprocal square root.

With N = 2827, that speeds gfortran and g++ from about 4.3 microseconds to 3.5
microseconds.
For comparison, Clang takes about 2 microseconds, and Flang/ispc/and awful
looking unsafe Rust take 2.3-2.4 microseconds, using the vrsqrt14ps (without a
mask) and a Newton step, instead of vsqrtps followed by a division.


So, "-funsafe-math-optimizations" results in a regression here.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #22 from Chris Elrod  ---
Okay. I did that, and the time went from about 4.25 microseconds down to 4.0
microseconds. So that is an improvement, but accounts for only a small part of
the difference with the LLVM-compilers.

-O3 -fno-math-errno

was about 3.5 microseconds, so -funsafe-math-optimizations still results in a
regression in this code.

3.5 microseconds is roughly as fast as you can get with vsqrt and div.

My best guess now is that gcc does a lot more to improve the accuracy of vsqrt.
If I understand correctly, these are all the involved instructions:

vmovaps .LC2(%rip), %zmm7
vmovaps .LC3(%rip), %zmm6
# for loop begins
vrsqrt14ps  %zmm1, %zmm2 # comparison and mask removed
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm2, %zmm0, %zmm1
vmulps  %zmm6, %zmm0, %zmm0
vaddps  %zmm7, %zmm1, %zmm1
vmulps  %zmm0, %zmm1, %zmm1
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm0
vfnmadd213ps(%r10,%rax), %zmm0, %zmm2

If I understand this correctly:

zmm2 =(approx) 1 / sqrt(zmm1)
zmm0 = zmm1 * zmm2 = (approx) sqrt(zmm1)
zmm1 = zmm0 * zmm2 = (approx) 1
zmm0 = zmm6 * zmm0 = (approx) constant6 * sqrt(zmm1)
zmm1 = zmm7 * zmm1 = (approx) constant7
zmm1 = zmm0 * zmm1 = (approx) constant6 * constant6 * sqrt(zmm1)
zmm0 = (approx) 1 / zmm1 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 *
constant7)
zmm1 = zmm1 * zmm0 = (approx) 1
zmm1 = zmm1 * zmm0 = (approx) 1 / sqrt(zmm1) * 1 / (constant6 * constant7)
zmm0 = 2 * zmm0 = (approx) 2 / sqrt(zmm1) * 2 / (constant6 * constant7)
zmm0 = zmm1 - zmm0 = (approx) -1 / sqrt(zmm1) * 1 / (constant6 * constant7)

which implies that constant6 * constant6 = approximately -1?


LLVM seems to do a much simpler / briefer update of the output of vrsqrt.

When I implemented a vrsqrt intrinsic in a Julia library, I just looked at
Wikipedia and did (roughly):

constant1 = -0.5
constant2 = 1.5

zmm2 = (approx) 1 / sqrt(zmm1)
zmm3 = constant * zmm1
zmm1 = zmm2 * zmm2
zmm3 = zmm3 * zmm1 + constant2
zmm2 = zmm2 * zmm3


I am not a numerical analyst, so I can't comment on relative validities or
accuracies of these approaches.
I also don't know what LLVM 7+ does. LLVM 6 doesn't use vrsqrt.

I would be interesting in reading explanations or discussions, if any are
available.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #24 from Chris Elrod  ---
The dump looks like this:

  vect__67.78_217 = SQRT (vect__213.77_225);
  vect_ui33_68.79_248 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0
} / vect__67.78_217;
  vect__71.80_249 = vect__246.59_65 * vect_ui33_68.79_248;
  vect_u13_73.81_250 = vect__187.71_14 * vect_ui33_68.79_248;
  vect_u23_75.82_251 = vect__200.74_5 * vect_ui33_68.79_248;

so the vrsqrt optimization happens later. g++ shows the same problems with
weird code generation. However this:

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

does not match this:

vrsqrt14ps  %zmm1, %zmm2 # comparison and mask removed
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm2, %zmm0, %zmm1
vmulps  %zmm6, %zmm0, %zmm0
vaddps  %zmm7, %zmm1, %zmm1
vmulps  %zmm0, %zmm1, %zmm1
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm0

Recommendations on the next place to look for what's going on?

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #26 from Chris Elrod  ---
> You can try enabling -mrecip to see RSQRT in .optimized - there's
> probably late 1/sqrt optimization on RTL.

No luck. The full commands I used:

gfortran -Ofast -mrecip -S -fdump-tree-optimized -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
gfortvectorizationdump.s  vectorization_test.f90

g++ -mrecip -Ofast -fdump-tree-optimized -S -march=native -shared -fPIC
-mprefer-vector-width=512 -fno-semantic-interposition -o
gppvectorization_test.s  vectorization_test.cpp

g++'s output was similar:

  vect_U33_60.31_372 = SQRT (vect_S33_59.30_371);
  vect_Ui33_61.32_374 = { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0,
1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0
} / vect_U33_60.31_372;
  vect_U13_62.33_375 = vect_S13_47.24_359 * vect_Ui33_61.32_374;
  vect_U23_63.34_376 = vect_S23_53.27_365 * vect_Ui33_61.32_374;

and it has the same assembly as gfortran for the rsqrt:

vcmpps  $4, %zmm0, %zmm5, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
vmulps  %zmm0, %zmm1, %zmm2
vmulps  %zmm1, %zmm2, %zmm0
vmulps  %zmm6, %zmm2, %zmm2
vaddps  %zmm7, %zmm0, %zmm0
vmulps  %zmm2, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm10
vmulps  %zmm0, %zmm10, %zmm0
vmulps  %zmm0, %zmm10, %zmm0
vaddps  %zmm10, %zmm10, %zmm10
vsubps  %zmm0, %zmm10, %zmm10

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #27 from Chris Elrod  ---
g++ -mrecip=all -O3  -fno-signed-zeros -fassociative-math -freciprocal-math
-fno-math-errno -ffinite-math-only -fno-trapping-math -fdump-tree-optimized -S
-march=native -shared -fPIC -mprefer-vector-width=512
-fno-semantic-interposition -o gppvectorization_test.s  vectorization_test.cpp

is not enough to get vrsqrt. I need -funsafe-math-optimizations for the
instruction to appear in the asm.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #28 from Chris Elrod  ---
Created attachment 45501
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45501&action=edit
Minimum working example of the rsqrt problem. Can be compiled with: gcc -Ofast
-S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC rsqrt.c -o
rsqrt.s

I attached a minimum working example, demonstrating the problem of excessive
code generation for reciprocal square root, in the file rsqrt.c.
You can compile with:

gcc -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC
rsqrt.c -o rsqrt.s

clang -Ofast -S -march=skylake-avx512 -mprefer-vector-width=512 -shared -fPIC
rsqrt.c -o rsqrt.s

Or compare the asm of both on Godbolt: https://godbolt.org/z/c7Z0En

For gcc:

vmovups (%rsi), %zmm0
vxorps  %xmm1, %xmm1, %xmm1
vcmpps  $4, %zmm0, %zmm1, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}
vmulps  %zmm0, %zmm1, %zmm2
vmulps  %zmm1, %zmm2, %zmm0
vmulps  .LC1(%rip), %zmm2, %zmm2
vaddps  .LC0(%rip), %zmm0, %zmm0
vmulps  %zmm2, %zmm0, %zmm0
vrcp14ps%zmm0, %zmm1
vmulps  %zmm0, %zmm1, %zmm0
vmulps  %zmm0, %zmm1, %zmm0
vaddps  %zmm1, %zmm1, %zmm1
vsubps  %zmm0, %zmm1, %zmm0
vmovups %zmm0, (%rdi)

for Clang:

vmovups (%rsi), %zmm0
vrsqrt14ps  %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm0
vfmadd213ps .LCPI0_0(%rip){1to16}, %zmm1, %zmm0 # zmm0 = (zmm1 *
zmm0) + mem
vmulps  .LCPI0_1(%rip){1to16}, %zmm1, %zmm1
vmulps  %zmm0, %zmm1, %zmm0
vmovups %zmm0, (%rdi)

Clang looks like it is is doing
 /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0)
*/

where .LCPI0_0(%rip) = -3.0 and LCPI0_1(%rip) = -0.5.
gcc is doing much more, and fairly different.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #30 from Chris Elrod  ---
gcc still (In reply to Marc Glisse from comment #29)
> The main difference I can see is that clang computes rsqrt directly, while
> gcc first computes sqrt and then computes the inverse. Also gcc seems afraid
> of getting NaN for sqrt(0) so it masks out this value. ix86_emit_swsqrtsf in
> gcc/config/i386/i386.c seems like a good place to look at.

gcc caclulates the rsqrt directly with funsafe-math-optimizations and a couple
other flags (or just -ffast-math):

vmovups (%rsi), %zmm0
vxorps  %xmm1, %xmm1, %xmm1
vcmpps  $4, %zmm0, %zmm1, %k1
vrsqrt14ps  %zmm0, %zmm1{%k1}{z}

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #32 from Chris Elrod  ---
(In reply to Marc Glisse from comment #31)
> (In reply to Chris Elrod from comment #30)
> > gcc caclulates the rsqrt directly
> 
> No, vrsqrt14ps is just the first step in calculating sqrt here (slightly
> different formula than rsqrt). vrcp14ps shows that it is computing an
> inverse later. What we need to understand is why gcc doesn't try to generate
> rsqrt (which would also have vrsqrt14ps, but a slightly different formula
> without the comparison with 0 and masking, and without needing an inversion
> afterwards).

Okay, I think I follow you. You're saying instead of doing this (from
rguenther), which we want (also without the comparison to 0 and masking, as you
note):

 /* rsqrt(a) = -0.5 * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

it is doing this, which also uses the rsqrt instruction:

 /* sqrt(a)  = -0.5 * a * rsqrtss(a) * (a * rsqrtss(a) * rsqrtss(a) - 3.0) */

and then calculating an inverse approximation of that?

The approximate sqrt, and then approximate reciprocal approximations were
slower on my computer than just vsqrt followed by div.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-01-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #35 from Chris Elrod  ---
> rsqrt:
> .LFB12:
> .cfi_startproc
> vrsqrt28ps  (%rsi), %zmm0
> vmovups %zmm0, (%rdi)
> vzeroupper
> ret
> 
> (huh?  isn't there a NR step missing?)
> 


I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't
considered necessary. Unfortunately, march=skylake-avx512 does not have
-mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step.

I think vrsqrt14pd/s are -mavx512f or -mavx512vl

> Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without 
> that I don't know how the machinery can guess how to use rsqrt (there are 
> probably ways).

Looking at the asm from only r[i] = sqrtf(a[i]):

vmovups (%rsi), %zmm1
vxorps  %xmm0, %xmm0, %xmm0
vcmpps  $4, %zmm1, %zmm0, %k1
vrsqrt14ps  %zmm1, %zmm0{%k1}{z}
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm0, %zmm1, %zmm0
vmulps  .LC1(%rip), %zmm1, %zmm1
vaddps  .LC0(%rip), %zmm0, %zmm0
vmulps  %zmm1, %zmm0, %zmm0
vmovups %zmm0, (%rdi)

vs the asm from only r[i] = 1 /a[i]:

vmovups (%rsi), %zmm1
vrcp14ps%zmm1, %zmm0
vmulps  %zmm1, %zmm0, %zmm1
vmulps  %zmm1, %zmm0, %zmm1
vaddps  %zmm0, %zmm0, %zmm0
vsubps  %zmm1, %zmm0, %zmm0
vmovups %zmm0, (%rdi)

it looks like the expander is there for sqrt, and for inverse, and we're just
getting both one after the other. So it does look like I could benchmark which
one is slower than the regular instruction on my platform, if that would be
useful.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

2019-02-12 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713

--- Comment #54 from Chris Elrod  ---
I commented elsewhere, but I built trunk a few days ago with H.J.Lu's patches
(attached here) and Thomas Koenig's inlining patches.
With these patches, g++ and all versions of the Fortran code produced excellent
asm, and the code performed excellently in benchmarks.

Once those are merged, the problems reported here will be solved.

I saw Thomas Koenig's packing changes will wait for gcc-10.
What about H.J.Lu's fixes to rsqrt and allowing FMA use in those sections?

[Bug rtl-optimization/86625] New: funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-21 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

Bug ID: 86625
   Summary: funroll-loops doesn't unroll, producing >3x assembly
and running 10x slower than manual complete unrolling
   Product: gcc
   Version: 8.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

I wasn't sure where to put this.
I posted in the Fortran gcc mailing list initially, but was redirected to
bugzilla.
I specified RTL-optimization as the component, because the manually unrolled
version avoids register spills yet has 13 (unnecessary?) vmovapd instructions
between registers, and the loop version is a behemoth of moving data in, out,
and between registers.

The failure of the loop might also fall under tree optimization?

For that reason, completely unrolling the loop actually results in over 3x less
assembly than the loop. Unfortunately, funroll-loops did not complete unroll,
making the manual unrolling necessary.
Assembly is identical whether or not funroll-loops is used.
Adding the directive: 
   !GCC$ unroll 31
does lead to complete unrolling, but also use of xmm registers instead of zmm,
and thus massive amounts of spilling (and probably extremely slow code -- did
not benchmark).

Here is the code (a 16x32 * 32x14 matrix multiplication kernel for avx-512 [the
32 is arbitrary]), sans directive:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90

I compiled with:
gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops
-S -shared -fPIC kernels.f90 -o kernels.s

resulting in this assembly (without the directive):
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s



The manually unrolled version has 13 vmovapd instructions that look unnecessary
(like a vfmadd should've been able to place the answer in the correct
location?). 8 of them move from one register to another, and 5 look something
like:
vmovapd%zmm20, 136(%rsp)


I suspect there should ideally be 0 of these?
If not, I'd be interested in learning more about why.
This at least seems like an RTL optimization bug/question.

The rest of the generated code looks great to me. Repeated blocks of only:
2x vmovupd
7x vbroadcastsd
14x vfmadd231pd



In the looped code, however, the `vfmadd231pd` instructions are a rare sight
between all the register management. The loop code begins at line 1475 in the
assembly file.

While the manually unrolled code benchmarked at 135ns, the looped version took
1.4 microseconds on my computer.

Trying to understand more about what it's doing:
- While the manually unrolled code has the expected 868 = (16/8)*(32-1)*14
vfmadds for the fully unrolled code, the looped version has two blocks of 224 =
(16/8)*X*14, where X = 8, indicating it is partially unrolling the loop.
One of them is using xmm registers instead of zmm, so it looks like the
compiler mistakenly things smaller vectors may be needed to clean up something?

(Maybe it is trying to vectorize across loop iterations, rather than within, in
some weird way? I don't know why it'd be using all those vpermt2pd, otherwise.)

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-22 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #2 from Chris Elrod  ---
Created attachment 44418
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44418&action=edit
Code to reproduce slow vectorization pattern and unnecessary loads & stores

(Sorry if this goes to the bottom instead of top, trying to attach a file in
place of a link, but I can't edit the old comment.)

Attached is sample code to reproduce the problem in gcc 8.1.1
As observed by amonakov, compiling with -O3/-Ofast reproduces the full problem,
eg:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512 -funroll-loops
-S kernels.f90 -o kernels.s

Compiling with -O3 -fdisable-tree-cunrolli or -O2 -ftree-vectorize fixes the
incorrect vectorization pattern, but leave a lot of unnecessary broadcast loads
and stores.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #4 from Chris Elrod  ---
Created attachment 44423
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44423&action=edit
8x16 * 16x6 kernel for avx2.

Here is a scaled down version to reproduce most of the the problem for
avx2-capable architectures.
I just used march=haswell, but I think most recent architectures fall under
this.
For some, like zenv1, you may need to add -mprefer-vector-width=256.


To get the inefficiently vectorized loop:

gfortran -march=haswell -Ofast -shared -fPIC -S kernelsavx2.f90 -o
kernelsavx2bad.s

To get only the unnecessary loads/stores, use:

gfortran -march=haswell -O2 -ftree-vectorize -shared -fPIC -S kernelsavx2.f90
-o kernelsavx2.s

This file compiles instantly, while with `O3` the other one can take a couple
seconds.
However while it does `vmovapd` between registers, it no longer spills into the
stack in the manually unrolled version, like the avx512 kernel does.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #5 from Chris Elrod  ---
Created attachment 44424
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44424&action=edit
Smaller avx512 kernel that still spills into the stack

This generated 18 total `vmovapd` (I think there'd ideally be 0) when compiled
with:

gfortran -march=skylake-avx512 -mprefer-vector-width=512 -O2 -ftree-vectorize
-shared -fPIC -S kernels16x32x13.f90 -o kernels16x32x13.s

4 of which moved onto the stack, and one moved from the stack back into a
register.
(The others were transfered from the stack within vfmadd instructions:
`vfmadd213pd72(%rsp), %zmm11, %zmm15`
)


Similar to the larger kernel, using `-O3` instead of `-O2 -ftree-vectorize`
eliminated two of the `vmovapd`instructions between registers, but none of the
spills.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #6 from Chris Elrod  ---
(In reply to Richard Biener from comment #3)
> If you see spilling on the manually unrolled loop register pressure is
> somehow an issue.

In the matmul kernel:
D = A * X
where D is 16x14, A is 16xN, and X is Nx14 (N arbitrarily set to 32)

The code holds all of D in registers.
16x14 doubles, and 8 doubles per register mean 28 of the 32 registers.

Then, it loads 1 column of A at a time (2 more registers), and broadcasts
elements from the corresponding row in each column of X, updating the
corresponding column of D with fma instructions.

By broadcasting 2 at a time, it should be using exactly 32 registers.

For the most part, that is precisely what the manually unrolled code is doing
for each column of A.
However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the 32
columns of A, it suddenly spills (all the stack accesses happen for the same
column, and none of the others), even though the process is identical for each
column.
Switching to a smaller 16x13 output, freeing up 2 registers to allow 4
broadcast loads at a time, still resulted in 4 spills (down from 5) for only
column #23 or #25.

I couldn't reproduce the spills in the avx2 kernel.
The smaller kernel has an 8x6 output, taking up 12 registers. Again leaving 4
total registers, 2 for a column of A, and 2 broadcasts from X at a time. So
it's the same pattern.


The smaller kernel does reproduce the problems with the loops. Both -O3 without
`-fdisable-tree-cunrolli` leading to a slow vectorization scheme, and with it
or `-O2 -ftree-vectorize` producing repetitive loads and stores within the
loop.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

2018-07-23 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625

--- Comment #7 from Chris Elrod  ---
(In reply to Chris Elrod from comment #6)
> However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the
> 32 columns of A

Correction: it was the 16x13 version that used stack data after loading column
25 instead of 23 of A.

[Bug fortran/57992] Pointless packing of contiguous arrays for simply contiguous functions results as actual arguments

2018-11-15 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992

Chris Elrod  changed:

   What|Removed |Added

 CC||elrodc at gmail dot com

--- Comment #3 from Chris Elrod  ---
Created attachment 45014
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45014&action=edit
Code that produces lots of unnecessary and performance-crippling
_gfortran_internal_pack@PLT and _gfortran_internal_unpack@PLT

Code that produces lots of unnecessary and performance-crippling
_gfortran_internal_pack@PLT and _gfortran_internal_unpack@PLT:

I compiled with:

```
gfortran -S -Ofast -fno-repack-arrays -fdisable-tree-cunrolli
-fno-semantic-interposition -march=skylake-avx512 -mprefer-vector-width=512
-mveclibabi=svml -shared -fPIC -finline-limit=8192
gfortran_internal_pack_test.f90 -o gfortran_internal_pack_test.s
```

using

$ gfortran --version
GNU Fortran (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)

[Bug fortran/57992] Pointless packing of contiguous arrays for simply contiguous functions results as actual arguments

2018-11-15 Thread elrodc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57992

--- Comment #4 from Chris Elrod  ---
Created attachment 45016
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45016&action=edit
Assembly from compiling gfortran_internal_pack_test.f90

The code takes in sets of 3-length vectors and 3x3 symmetric positive definite
matrices (storing only the upper triangle). These are stored across columns.
That is, element 1 of the first and second vectors are stored contiguously,
while elements 1 and 2 of each vector are stride apart.

The goal is to factor each PD matrix into S = U*U' (not the Cholesky), and then
computes U^{-1} * x.

There is a function that operates on one vector and matrix at a time
(pdbacksolve).
Another function operates on blocks of 16 at a time (vpdbacksolve).

Three versions of functions operate on these:
Version 1 simply loops over the inputs, calling the scalar version.

Version 2 loops over blocks of 16 at a time, calling the blocked version.

Version 3 manually inlined the function into the do loop.

I used compiler options to ensure that all the functions were inlined into
callers, so that ideally Version 2 and Version 3 would be identical.
Attached assembly shows that they are not.

Letting N = 1024 total vectors and matrices, on my computer
Version 1 takes 97 microseconds to run, version 2 35 microseconds, and version
3 1.4 microseconds.
These differences are dramatic!
Version 1 failed to vectorize and was littered with _gfortran_internal_pack@PLT
and _gfortran_internal_unpack@PLT. Version 2 vectorized, but also had all the
pack/unpacks. Version 3 had neither.
Data layout was the same (and optimal for vectorization) in all three cases.

[Also worth pointing out that without -fdisable-tree-cunrolli, version 3 takes
9 microseconds.]

For what it is worth, ifort takes 0.82, 1.5, and 0.88 microseconds
respectively. 

I'd hope it is possible for gfortran's version 1 and 2 to match it's version 3
(1.4 microseconds) rather than being 70x and 25 slower. 1.4 microseconds is a
good time, and the best I managed to achieve with explicit vectorization in
Julia.
I could file a different bug report, because the failed vectorization of
version 1 is probably a different issue. But this is another example of
unnecessary packing/unpacking.

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929

Chris Elrod  changed:

   What|Removed |Added

 CC||elrodc at gmail dot com

--- Comment #29 from Chris Elrod  ---
"RESOLVED FIXED". I haven't tried this with `target`, but avx512bw does not
work with target_clones with gcc 11.2, but it does with clang 14.

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929

--- Comment #30 from Chris Elrod  ---
> #if defined(__clang__)
> #define MULTIVERSION  
>  \
> __attribute__((target_clones("avx512dq", "avx2", "default")))
> #else
> #define MULTIVERSION  
>  \
> __attribute__((target_clones( 
>  \
> "arch=skylake-avx512,arch=cascadelake,arch=icelake-client,arch="  
>  \
> "tigerlake,"  
>  \
> "arch=icelake-server,arch=sapphirerapids,arch=cooperlake",
>  \
> "avx2", "default")))
> #endif

For example, I can do something like this, but gcc produces a ton of
unnecessary duplicates for each of the avx512dq architectures. There must be a
better way.

[Bug target/89929] __attribute__((target("avx512bw"))) doesn't work on non avx512bw systems

2022-05-30 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89929

--- Comment #32 from Chris Elrod  ---
Ha, I accidentally misreported my gcc version. I was already using 12.1.1.

Using x86-64-v4 worked, excellent! Thanks.

[Bug target/114276] New: Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`

2024-03-07 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276

Bug ID: 114276
   Summary: Trapping on aligned operations when using vector
builtins + `-std=gnu++23 -fsanitize=address
-fstack-protector-strong`
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

Created attachment 57651
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57651&action=edit
test file

I'm not sure how to categorize the issue, so I picked "target" as it occurs for
x86_64 when using aligned moves on 64-byte avx512 vectors.

`-std=c++23` also reproduces the problem.
I am using:

> g++ --version
> g++ (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6)
> Copyright (C) 2023 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The attached file is:

> #include 
> #include 
> 
> template 
> using Vec [[gnu::vector_size(W * sizeof(T))]] = T;
> 
> auto foo() {
>   Vec<8, int64_t> ret{};
>   return ret;
> }
> 
> int main() {
>   foo();
>   return 0;
> }

I have attached this file.

On a skylake-avx512 CPU, I get

> g++ -std=gnu++23 -march=skylake-avx512 -fstack-protector-strong -O0 -g 
> -mprefer-vector-width=512 -fsanitize=address,undefined -fsanitize-trap=all 
> simdvecalign.cpp && ./a.out
AddressSanitizer:DEADLYSIGNAL
=
==36238==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp
0x7ffdf88a1cb0 sp 0x7ffdf88a1bc0 T0)
==36238==The signal is caused by a READ memory access.
==36238==Hint: this fault was caused by a dereference of a high value address
(see register values below).  Disassemble the provided pc to learn which
register was used.
#0 0x40125c in foo()
/home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8
#1 0x4012d1 in main
/home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:13
#2 0x7f296b846149 in __libc_start_call_main (/lib64/libc.so.6+0x28149)
(BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
#3 0x7f296b84620a in __libc_start_main_impl (/lib64/libc.so.6+0x2820a)
(BuildId: 7ea8d85df0e89b90c63ac7ed2b3578b2e7728756)
#4 0x4010a4 in _start
(/home/chriselrod/Documents/progwork/cxx/experiments/a.out+0x4010a4) (BuildId:
765272b0173968b14f4306c8d4a37fcb18733889)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV
/home/chriselrod/Documents/progwork/cxx/experiments/simdvecalign.cpp:8 in foo()
==36238==ABORTING
fish: Job 1, './a.out' terminated by signal SIGABRT (Abort)

However, if I remove any of `-std=gnu++23`, `-fsantize=address`, or
`-fstack-protector-strong`, the code runs without a problem.

Using 32 byte vectors instead of 64 byte also allows it to work.

I also used `-S` to look at the assembly.

When I edit the two lines:
>   vmovdqa64   %zmm0, -128(%rdx)
>   .loc 1 9 10
>   vmovdqa64   -128(%rdx), %zmm0

swapping `vmovdqa64` for `vmovdqu64`, the code runs as intended.

> g++ -fsanitize=address simdvecalign.s # using vmovdqu64
> ./a.out
> g++ -fsanitize=address simdvecalign.s # reverted back to vmovdqa64
> ./a.out
AddressSanitizer:DEADLYSIGNAL
=
==40364==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0040125c bp
0x7ffd2e2dc240 sp 0x7ffd2e2dc140 T0)

so I am inclined to think that something isn't guaranteeing that `%rdx` is
actually 64-byte aligned (but it may be 32-byte aligned, given that I can't
reproduce with 32 byte vectors).

[Bug target/114276] Trapping on aligned operations when using vector builtins + `-std=gnu++23 -fsanitize=address -fstack-protector-strong`

2024-03-07 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276

--- Comment #1 from Chris Elrod  ---
Created attachment 57652
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57652&action=edit
assembly from adding `-S`

[Bug target/110027] Misaligned vector store on detect_stack_use_after_return

2024-03-08 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110027

--- Comment #9 from Chris Elrod  ---
> Interestingly this seems to be only reproducible on Arch Linux. Other gcc 
> 13.1.1 builds, Fedora for instance, seem to behave correctly. 

I haven't tried that reproducer on Fedora with gcc 13.2.1, which could have
regressed since 13.1.1.
However, the dup example in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114276
does reproduce on Fedora with gcc-13.2.1 once you add extra compile flags
`-std=c++23 -fstack-protector-strong`.
I'll try the original reproducer later, it may be the case that adding/removing
these flags fuzzes the alignment.

[Bug c++/111493] New: [concepts] multidimensional subscript operator inside requires is broken

2023-09-20 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493

Bug ID: 111493
   Summary: [concepts] multidimensional subscript operator inside
requires is broken
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

Two example programs:

>  #include 
>  constexpr auto foo(const auto &A, int i, int j)  
> requires(requires(decltype(A) a, int ii) { a[ii, ii]; }) {
>return A[i, j];
>  }
>  constexpr auto foo(const auto &A, int i, int j) {
>return A + i + j;
>  }
>  static_assert(foo(2,3,4) == 9);


>  #include 
>  template 
>  concept CartesianIndexable = requires(T t, int i) {
>{ t[i, i] } -> std::convertible_to;
>  };
>  static_assert(!CartesianIndexable);

These result in errors of the form

  error: invalid types 'const int[int]' for array subscript

Here is godbolt for reference: https://godbolt.org/z/WE66nY8zG

The invalid subscript should result in the `requires` failing, not an error.

[Bug c++/111493] [concepts] multidimensional subscript operator inside requires is broken

2023-09-20 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111493

--- Comment #2 from Chris Elrod  ---
Note that it also shows up in gcc-13. I put gcc-14 as the version to indicate
that I confirmed it is still a problem on latest trunk. Not sure what the
policy is on which version we should report.

[Bug c++/93008] Need a way to make inlining heuristics ignore whether a function is inline

2024-05-05 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93008

--- Comment #14 from Chris Elrod  ---
To me, an "inline" function is one that the compiler inlines.
It just happens that the `inline` keyword also means both comdat semantics, and
possibly hiding the symbol to make it internal (-fvisibility-inlines-hidden).
It also just happens to be the case that the vast majority of the time I mark a
function `inline`, it is because of this, not because of the compiler hint.
`static` of course also specifies internal linkage, but I generally prefer the
comdat semantics: I'd rather merge than duplicate the definitions.

If there is a new keyword or pragma meaning comdat semantics (and preferably
also specifying internal linkage), I would rather have the name reference that.

I'd rather have a positive name about what it does, than negative:
"quasi_inline: like inline, except it does everything inline does except the
inline part".
Why define as a set diff -- naming it after the thing it does not do! -- if you
could define it in the affirmative, based on what it does in the first place?

[Bug tree-optimization/112824] New: Stack spills and vector splitting with vector builtins

2023-12-02 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

Bug ID: 112824
   Summary: Stack spills and vector splitting with vector builtins
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: elrodc at gmail dot com
  Target Milestone: ---

I am not sure which component to place this under, but selected
tree-optimization as I suspect this is some sort of alias analysis failure
preventing the removal of stack allocations.

Godbolt link, reproduces on GCC trunk and 13.2:
https://godbolt.org/z/4TPx17Mbn
Clang has similar problems in my actual test case, but they don't show up in
this minimal example I made. Although Clang isn't perfect here either: it fails
to fuse fmadd + masked vmovapd, while GCC does succeed in fusing them.

For reference, code behind the godbolt link is:

#include 
#include 
#include 
#include 

template 
using Vec [[gnu::vector_size(W * sizeof(T))]] = T;


// Omitted: 16 without AVX, 32 without AVX512F,
// or for forward compatibility some AVX10 may also mean 32-only
static constexpr ptrdiff_t VectorBytes = 64;
template
static constexpr ptrdiff_t VecWidth = 64 <= sizeof(T) ? 1 : 64/sizeof(T);

template  struct Vector{
static constexpr ptrdiff_t L = N;
T data[L];
static constexpr auto size()->ptrdiff_t{return N;}
};
template  struct Vector{
static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth :
ptrdiff_t(std::bit_ceil(size_t(N))); 
static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0);
using V = Vec;
V data[L];
static constexpr auto size()->ptrdiff_t{return N;}
};
/// should be trivially copyable
/// codegen is worse when passing by value, even though it seems like it should
make
/// aliasing simpler to analyze?
template
[[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y)
-> Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] +
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y)
-> Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] *
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator+(T x, Vector y) ->
Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(T x, Vector y) ->
Vector {
Vector z;
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n];
return z;
}



template  struct Dual {
  T value;
  Vector partials;
};
// Here we have a specialization for non-power-of-2 `N`
template  
requires(std::floating_point && (std::popcount(size_t(N))>1))
struct Dual {
  Vector data;
};


template
consteval auto firstoff(){
static_assert(std::same_as, "type not implemented");
if constexpr (W==2) return Vec<2,int64_t>{0,1} != 0;
else if constexpr (W == 4) return Vec<4,int64_t>{0,1,2,3} != 0;
else if constexpr (W == 8) return Vec<8,int64_t>{0,1,2,3,4,5,6,7} != 0;
else static_assert(false, "vector width not implemented");
}

template 
[[gnu::always_inline]] constexpr auto operator+(Dual a,
Dual b)
  -> Dual {
  if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){
Dual c;
for (ptrdiff_t l = 0; l < Vector::L; ++l)
  c.data.data[l] = a.data.data[l] + b.data.data[l]; 
return c;
  } else return {a.value + b.value, a.partials + b.partials};
}

template 
[[gnu::always_inline]] constexpr auto operator*(Dual a,
Dual b)
  -> Dual {
  if constexpr (std::floating_point && (std::popcount(size_t(N))>1)){
using V = typename Vector::V;
V va = V{}+a.data.data[0][0], vb = V{}+b.data.data[0][0];
V x = va * b.data.data[0];
Dual c;
c.data.data[0] = firstoff::W,T>() ? x + vb*a.data.data[0] : x;
for (ptrdiff_t l = 1; l < Vector::L; ++l)
  c.data.data[l] = va*b.data.data[l] + vb*a.data.data[l]; 
return c;
  } else return {a.value * b.value, a.value * b.partials + b.value *
a.partials};
}

void prod(Dual,2> &c, const Dual,2> &a, const
Dual,2>&b){
c = a*b;
}
void prod(Dual,2> &c, const Dual,2> &a, const
Dual,2>&b){
c = a*b;
}


GCC 13.2 asm, when compiling with
-std=gnu++23 -march=skylake-avx512 -mprefer-vector-width=512 -O3


prod(Dual, 2l>&, Dual, 2l> const&,
Dual, 2l> const&):
pushrbp
mov eax, -2
kmovb   k1, eax
mov rbp, rsp
and rsp, -64
sub rsp, 264
vmovdqa ymm4, YMMWORD PTR [rsi+128]
vmovapd zmm8, ZMMWORD PTR [rsi]
   

[Bug tree-optimization/112824] Stack spills and vector splitting with vector builtins

2023-12-02 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #1 from Chris Elrod  ---
Here I have added a godbolt example where I manually unroll the array, where
GCC generates excellent code https://godbolt.org/z/sd4bhGW7e
I'm not sure it is 100% optimal, but with an inner Dual size of `7`, on
Skylake-X it is 38 uops for unrolled GCC with separate struct fields, vs 49
uops for Clang, vs 67 for GCC with arrays.
uica expects <14 clock cycles for the manually unrolled vs >23 for the array
version.

My experience so far with expression templates has born this out: compilers
seem to struggle with peeling away abstractions.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-03 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #2 from Chris Elrod  ---
https://godbolt.org/z/3648aMTz8

Perhaps a simpler diff is that you can reproduce by uncommenting the pragma,
but codegen becomes good with it.

template
constexpr auto operator*(OuterDualUA2 a, OuterDualUA2
b)->OuterDualUA2{  
  //return
{a.value*b.value,a.value*b.p[0]+b.value*a.p[0],a.value*b.p[1]+b.value*a.p[1]}; 
  OuterDualUA2 c;
  c.value = a.value*b.value;
#pragma GCC unroll 16
  for (ptrdiff_t i = 0; i < 2; ++i)
c.p[i] = a.value*b.p[i] + b.value*a.p[i];
  //c.p[0] = a.value*b.p[0] + b.value*a.p[0];
  //c.p[1] = a.value*b.p[1] + b.value*a.p[1];
  return c;
}


It's not great to have to add pragmas everywhere to my actual codebase. I
thought I hit the important cases, but my non-minimal example still gets
unnecessary register splits and stack spills, so maybe I missed places, or
perhaps there's another issue.

Given that GCC unrolls the above code even without the pragma, it seems like a
definite bug that the pragma is needed for the resulting code generation to
actually be good.
Not knowing the compiler pipeline, my naive guess is that the pragma causes
earlier unrolling than whatever optimization pass does it sans pragma, and that
some important analysis/optimization gets run between those two times.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-03 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #3 from Chris Elrod  ---
> I thought I hit the important cases, but my non-minimal example still gets 
> unnecessary register splits and stack spills, so maybe I missed places, or 
> perhaps there's another issue.

Adding the unroll pragma to the `Vector`'s operator + and *:

template
[[gnu::always_inline]] constexpr auto operator+(Vector x, Vector y)
-> Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] +
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(Vector x, Vector y)
-> Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x.data[n] *
y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator+(T x, Vector y) ->
Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x + y.data[n];
return z;
}
template
[[gnu::always_inline]] constexpr auto operator*(T x, Vector y) ->
Vector {
Vector z;
#pragma GCC unroll 16
for (ptrdiff_t n = 0; n < Vector::L; ++n) z.data[n] = x * y.data[n];
return z;
}


does not improve code generation (still get the same problem), so that's a
reproducer for such an issue.

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-04 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #6 from Chris Elrod  ---
Hongtao Liu, I do think that one should ideally be able to get optimal codegen
when using 512-bit builtin vectors or vector intrinsics, without needing to set
`-mprefer-vector-width=512` (and, currently, also setting
`-mtune-ctrl=avx512_move_by_pieces`).

For example, if I remove `-mprefer-vector-width=512`, I get

prod(Dual, 2l>&, Dual, 2l> const&,
Dual, 2l> const&):
pushrbp
mov eax, -2
kmovb   k1, eax
mov rbp, rsp
and rsp, -64
sub rsp, 264
vmovdqa ymm4, YMMWORD PTR [rsi+128]
vmovapd zmm8, ZMMWORD PTR [rsi]
vmovapd zmm9, ZMMWORD PTR [rdx]
vmovdqa ymm6, YMMWORD PTR [rsi+64]
vmovdqa YMMWORD PTR [rsp+8], ymm4
vmovdqa ymm4, YMMWORD PTR [rdx+96]
vbroadcastsdzmm0, xmm8
vmovdqa ymm7, YMMWORD PTR [rsi+96]
vbroadcastsdzmm1, xmm9
vmovdqa YMMWORD PTR [rsp-56], ymm6
vmovdqa ymm5, YMMWORD PTR [rdx+128]
vmovdqa ymm6, YMMWORD PTR [rsi+160]
vmovdqa YMMWORD PTR [rsp+168], ymm4
vxorpd  xmm4, xmm4, xmm4
vaddpd  zmm0, zmm0, zmm4
vaddpd  zmm1, zmm1, zmm4
vmovdqa YMMWORD PTR [rsp-24], ymm7
vmovdqa ymm7, YMMWORD PTR [rdx+64]
vmovapd zmm3, ZMMWORD PTR [rsp-56]
vmovdqa YMMWORD PTR [rsp+40], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+160]
vmovdqa YMMWORD PTR [rsp+200], ymm5
vmulpd  zmm2, zmm0, zmm9
vmovdqa YMMWORD PTR [rsp+136], ymm7
vmulpd  zmm5, zmm1, zmm3
vbroadcastsdzmm3, xmm3
vmovdqa YMMWORD PTR [rsp+232], ymm6
vaddpd  zmm3, zmm3, zmm4
vmovapd zmm7, zmm2
vmovapd zmm2, ZMMWORD PTR [rsp+8]
vfmadd231pd zmm7{k1}, zmm8, zmm1
vmovapd zmm6, zmm5
vmovapd zmm5, ZMMWORD PTR [rsp+136]
vmulpd  zmm1, zmm1, zmm2
vfmadd231pd zmm6{k1}, zmm9, zmm3
vbroadcastsdzmm2, xmm2
vmovapd zmm3, ZMMWORD PTR [rsp+200]
vaddpd  zmm2, zmm2, zmm4
vmovapd ZMMWORD PTR [rdi], zmm7
vfmadd231pd zmm1{k1}, zmm9, zmm2
vmulpd  zmm2, zmm0, zmm5
vbroadcastsdzmm5, xmm5
vmulpd  zmm0, zmm0, zmm3
vbroadcastsdzmm3, xmm3
vaddpd  zmm5, zmm5, zmm4
vaddpd  zmm3, zmm3, zmm4
vfmadd231pd zmm2{k1}, zmm8, zmm5
vfmadd231pd zmm0{k1}, zmm8, zmm3
vaddpd  zmm2, zmm2, zmm6
vaddpd  zmm0, zmm0, zmm1
vmovapd ZMMWORD PTR [rdi+64], zmm2
vmovapd ZMMWORD PTR [rdi+128], zmm0
vzeroupper
leave
ret
prod(Dual, 2l>&, Dual, 2l> const&,
Dual, 2l> const&):
pushrbp
mov rbp, rsp
and rsp, -64
sub rsp, 648
vmovdqa ymm5, YMMWORD PTR [rsi+224]
vmovdqa ymm3, YMMWORD PTR [rsi+352]
vmovapd zmm0, ZMMWORD PTR [rdx+64]
vmovdqa ymm2, YMMWORD PTR [rsi+320]
vmovdqa YMMWORD PTR [rsp+104], ymm5
vmovdqa ymm5, YMMWORD PTR [rdx+224]
vmovdqa ymm7, YMMWORD PTR [rsi+128]
vmovdqa YMMWORD PTR [rsp+232], ymm3
vmovsd  xmm3, QWORD PTR [rsi]
vmovdqa ymm6, YMMWORD PTR [rsi+192]
vmovdqa YMMWORD PTR [rsp+488], ymm5
vmovdqa ymm4, YMMWORD PTR [rdx+192]
vmovapd zmm1, ZMMWORD PTR [rsi+64]
vbroadcastsdzmm5, xmm3
vmovdqa YMMWORD PTR [rsp+200], ymm2
vmovdqa ymm2, YMMWORD PTR [rdx+320]
vmulpd  zmm8, zmm5, zmm0
vmovdqa YMMWORD PTR [rsp+8], ymm7
vmovdqa ymm7, YMMWORD PTR [rsi+256]
vmovdqa YMMWORD PTR [rsp+72], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+128]
vmovdqa YMMWORD PTR [rsp+584], ymm2
vmovsd  xmm2, QWORD PTR [rdx]
vmovdqa YMMWORD PTR [rsp+136], ymm7
vmovdqa ymm7, YMMWORD PTR [rdx+256]
vmovdqa YMMWORD PTR [rsp+392], ymm6
vmovdqa ymm6, YMMWORD PTR [rdx+352]
vmulsd  xmm10, xmm3, xmm2
vmovdqa YMMWORD PTR [rsp+456], ymm4
vbroadcastsdzmm4, xmm2
vfmadd231pd zmm8, zmm4, zmm1
vmovdqa YMMWORD PTR [rsp+520], ymm7
vmovdqa YMMWORD PTR [rsp+616], ymm6
vmulpd  zmm9, zmm4, ZMMWORD PTR [rsp+72]
vmovsd  xmm6, QWORD PTR [rsp+520]
vmulpd  zmm4, zmm4, ZMMWORD PTR [rsp+200]
vmulpd  zmm11, zmm5, ZMMWORD PTR [rsp+456]
vmovsd  QWORD PTR [rdi], xmm10
vmulpd  zmm5, zmm5, ZMMWORD PTR [rsp+584]
vmovapd ZMMWORD PTR [rdi+64], zmm8
vfmadd231pd zmm9, zmm0, QWORD PTR [rsp+8]{1to8}
vfmadd231pd zmm4, zmm0, QWORD PTR [rsp+136]{1to8}
vmovsd  xmm0, QWORD PTR [rsp+392]
vmulsd  xmm7, xmm3, xmm0
vbroadcastsdzmm0, xmm0
vmulsd  xmm3, xmm3, xmm6
vfmadd132pd zmm0, zmm11, zmm1
vbroadcastsdzmm6, xmm6
vfmadd132pd zmm1, zmm5, zmm6
vfmadd231sd xmm7, xmm2, QWORD PTR [rsp+8]
vfmadd132sd  

[Bug middle-end/112824] Stack spills and vector splitting with vector builtins

2023-12-04 Thread elrodc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112824

--- Comment #8 from Chris Elrod  ---
> If it's designed the way you want it to be, another issue would be like, 
> should we lower 512-bit vector builtins/intrinsic to ymm/xmm when 
> -mprefer-vector-width=256, the answer is we'd rather not. 

To be clear, what I meant by

>  it would be great to respect
> `-mprefer-vector-width=512`, it should ideally also be able to respect
> vector builtins/intrinsics

is that when someone uses 512 bit vector builtins, that codegen should generate
512 bit code regardless of `mprefer-vector-width` settings.
That is, as a developer, I would want 512 bit builtins to mean we get 512-bit
vector code generation.

>  If user explicitly use 512-bit vector type, builtins or intrinsics, gcc will 
> generate zmm no matter -mprefer-vector-width=.

This is what I would want, and I'd also want it to apply to movement of
`struct`s holding vector builtin objects, instead of the `ymm` usage as we see
here.

> And yes, there could be some mismatches between 512-bit intrinsic and 
> architecture tuning when you're using 512-bit intrinsic, and also rely on 
> compiler autogen to handle struct
> For such case, an explicit -mprefer-vector-width=512 is needed.

Note the template partial specialization

template  struct Vector{
static constexpr ptrdiff_t W = N >= VecWidth ? VecWidth :
ptrdiff_t(std::bit_ceil(size_t(N))); 
static constexpr ptrdiff_t L = (N/W) + ((N%W)!=0);
using V = Vec;
V data[L];
static constexpr auto size()->ptrdiff_t{return N;}
};

Thus, `Vector`s in this example may explicitly be structs containing arrays of
vector builtins. I would expect these structs to not need an
`mprefer-vector-width=512` setting for producing 512 bit code handling this
struct.
Given small `L`, I would also expect passing this struct as an argument by
value to a non-inlined function to be done in `zmm` registers when possible,
for example.