[Bug tree-optimization/88713] Vectorized code slow vs. flang

elrodc at gmail dot com Wed, 23 Jan 2019 02:12:03 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713


--- Comment #35 from Chris Elrod <elrodc at gmail dot com> ---
> rsqrt:
> .LFB12:
>         .cfi_startproc
>         vrsqrt28ps      (%rsi), %zmm0
>         vmovups %zmm0, (%rdi)
>         vzeroupper
>         ret
> 
> (huh?  isn't there a NR step missing?)
> 


I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't
considered necessary. Unfortunately, march=skylake-avx512 does not have
-mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step.

I think vrsqrt14pd/s are -mavx512f or -mavx512vl

> Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without 
> that I don't know how the machinery can guess how to use rsqrt (there are 
> probably ways).

Looking at the asm from only r[i] = sqrtf(a[i]):

        vmovups (%rsi), %zmm1
        vxorps  %xmm0, %xmm0, %xmm0
        vcmpps  $4, %zmm1, %zmm0, %k1
        vrsqrt14ps      %zmm1, %zmm0{%k1}{z}
        vmulps  %zmm1, %zmm0, %zmm1
        vmulps  %zmm0, %zmm1, %zmm0
        vmulps  .LC1(%rip), %zmm1, %zmm1
        vaddps  .LC0(%rip), %zmm0, %zmm0
        vmulps  %zmm1, %zmm0, %zmm0
        vmovups %zmm0, (%rdi)

vs the asm from only r[i] = 1 /a[i]:

        vmovups (%rsi), %zmm1
        vrcp14ps        %zmm1, %zmm0
        vmulps  %zmm1, %zmm0, %zmm1
        vmulps  %zmm1, %zmm0, %zmm1
        vaddps  %zmm0, %zmm0, %zmm0
        vsubps  %zmm1, %zmm0, %zmm0
        vmovups %zmm0, (%rdi)

it looks like the expander is there for sqrt, and for inverse, and we're just
getting both one after the other. So it does look like I could benchmark which
one is slower than the regular instruction on my platform, if that would be
useful.

[Bug tree-optimization/88713] Vectorized code slow vs. flang

Reply via email to