https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #10 from Chris Elrod <elrodc at gmail dot com> --- (In reply to Thomas Koenig from comment #9) > Hm. > > It would help if your benchmark was complete, so I could run it. > I don't suppose you happen to have and be familiar with Julia? If you (or someone else here is), I'll attach the code to generate the fake data (the most important point is that columns 5:10 of BPP are the upper triangle of a 3x3 symmetric positive definite matrix). I have also already written a manually unrolled version that gfortran likes.. But I could write Fortran code to create an executable and run benchmarks. What are best practices? system_clock? (In reply to Thomas Koenig from comment #9) > > However, what happens if you put int > > real, dimension(:) :: Uix > real, dimension(:), intent(in) :: x > real, dimension(:), intent(in) :: S > > ? > > gfortran should not pack then. You're right! I wasn't able to follow this exactly, because it didn't want me to defer shape on Uix. Probably because it needs to compile a version of fpdbacksolve that can be called from the shared library? Interestingly, with that change, Flang failed to vectorize the code, but gfortran did. Compilers are finicky. Flang, original: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 655.827 ns (0.00% GC) median time: 665.698 ns (0.00% GC) mean time: 689.967 ns (0.00% GC) maximum time: 1.061 μs (0.00% GC) -------------- samples: 10000 evals/sample: 162 Flang, not specifying shape: # assembly shows it is using xmm BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 8.086 μs (0.00% GC) median time: 8.315 μs (0.00% GC) mean time: 8.591 μs (0.00% GC) maximum time: 20.299 μs (0.00% GC) -------------- samples: 10000 evals/sample: 3 gfortran, transposed version (not vectorizable): BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 20.643 μs (0.00% GC) median time: 20.901 μs (0.00% GC) mean time: 21.441 μs (0.00% GC) maximum time: 54.103 μs (0.00% GC) -------------- samples: 10000 evals/sample: 1 gfortran, not specifying shape: BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 1.290 μs (0.00% GC) median time: 1.316 μs (0.00% GC) mean time: 1.347 μs (0.00% GC) maximum time: 4.562 μs (0.00% GC) -------------- samples: 10000 evals/sample: 10 Assembly confirms it is using zmm registers (but this time is much too fast not to be vectorized, anyway). For why gfortran is still slower than the Flang version, here is the loop body: .L16: vmovups (%r10,%rax), %zmm0 vcmpps $4, %zmm0, %zmm4, %k1 vrsqrt14ps %zmm0, %zmm1{%k1}{z} vmulps %zmm0, %zmm1, %zmm2 vmulps %zmm1, %zmm2, %zmm0 vmulps %zmm5, %zmm2, %zmm2 vaddps %zmm6, %zmm0, %zmm0 vmulps %zmm2, %zmm0, %zmm0 vrcp14ps %zmm0, %zmm8 vmulps %zmm0, %zmm8, %zmm0 vmulps %zmm0, %zmm8, %zmm0 vaddps %zmm8, %zmm8, %zmm8 vsubps %zmm0, %zmm8, %zmm8 vmulps (%r8,%rax), %zmm8, %zmm9 vmulps (%r9,%rax), %zmm8, %zmm10 vmulps (%r12,%rax), %zmm8, %zmm8 vmovaps %zmm9, %zmm3 vfnmadd213ps 0(%r13,%rax), %zmm9, %zmm3 vcmpps $4, %zmm3, %zmm4, %k1 vrsqrt14ps %zmm3, %zmm2{%k1}{z} vmulps %zmm3, %zmm2, %zmm3 vmulps %zmm2, %zmm3, %zmm1 vmulps %zmm5, %zmm3, %zmm3 vaddps %zmm6, %zmm1, %zmm1 vmulps %zmm3, %zmm1, %zmm1 vmovaps %zmm9, %zmm3 vfnmadd213ps (%rdx,%rax), %zmm10, %zmm3 vrcp14ps %zmm1, %zmm0 vmulps %zmm1, %zmm0, %zmm1 vmulps %zmm1, %zmm0, %zmm1 vaddps %zmm0, %zmm0, %zmm0 vsubps %zmm1, %zmm0, %zmm11 vmulps %zmm11, %zmm3, %zmm12 vmovaps %zmm10, %zmm3 vfnmadd213ps (%r14,%rax), %zmm10, %zmm3 vfnmadd231ps %zmm12, %zmm12, %zmm3 vcmpps $4, %zmm3, %zmm4, %k1 vrsqrt14ps %zmm3, %zmm1{%k1}{z} vmulps %zmm3, %zmm1, %zmm3 vmulps %zmm1, %zmm3, %zmm0 vmulps %zmm5, %zmm3, %zmm3 vmovups (%rcx,%rax), %zmm1 vaddps %zmm6, %zmm0, %zmm0 vmulps %zmm3, %zmm0, %zmm0 vrcp14ps %zmm0, %zmm2 vmulps %zmm0, %zmm2, %zmm0 vmulps %zmm0, %zmm2, %zmm0 vaddps %zmm2, %zmm2, %zmm2 vsubps %zmm0, %zmm2, %zmm0 vmulps %zmm0, %zmm11, %zmm3 vmulps %zmm12, %zmm3, %zmm3 vxorps %zmm7, %zmm3, %zmm3 vmulps %zmm1, %zmm3, %zmm2 vmulps %zmm3, %zmm9, %zmm3 vfnmadd231ps %zmm8, %zmm9, %zmm1 vfmadd231ps (%r11,%rax), %zmm0, %zmm2 vfmadd132ps %zmm10, %zmm3, %zmm0 vmulps %zmm11, %zmm1, %zmm1 vfnmadd231ps %zmm0, %zmm8, %zmm2 vmovups %zmm2, (%rdi,%rax) vmovups %zmm1, (%rbx,%rax) vmovups %zmm8, (%r15,%rax) addq $64, %rax cmpq %rax, %rsi jne .L16 I see far more arithmetic instructions here. Is that because gcc is adding Newton-Raphson steps for the reciprocal square roots, and Flang is not? Trying to compare with mpfr, both seem about the same accurate. Extreme errors in X with gfortran: -2.676151882353759158425593894760401386764929650751873229109107488336451373232463e-06 1.396013166812755065773272342567265482854011149035436404435394035182092107481168e-05 with Flang: -3.086256120296619934226727657432734517145988850964563595564605964817907892518832e-05 2.28181026645836083851985181914739792956608114291961973078603672383927518755594e-06 on the data set I benchmarked with. Anyway, thanks for the prompt responses. And my issue was that gfortran didn't vectorize, but your second change fixed the problem. It would be nice of course if writing things one way would be optimized well across all compilers and versions. But compilers are finicky. Simply reordering operations and adding/removing temporary declarations in fpdbacksolve would sometimes cause Flang to fail to vectorize! Maybe I'll use #ifdefs around the declarations and save the files as .F90...